基于hadoop的微博文本分類及商業(yè)詞抽取
[Abstract]:With the rapid development of computer technology and network technology, Weibo has become a new media in China. Weibo's rapid expansion of the user base, coupled with the gradual dissemination of information, comes with the question of the unprecedented scale of Weibo data. In the face of the massive text data produced by Weibo in the rapid development of service, how to accurately and effectively find and obtain the materials and information of high commercial value needed from them, To improve the accuracy of advertising has become a major target of data processing in Weibo platform. This paper will study how to effectively find and extract commercial keywords from the massive Weibo text data. In order to extract business keywords more pertinently, the text classification of massive Weibo data is carried out first, which reduces the scale of single data processing on the one hand, and studies the same data processing on the other hand, it will be more targeted. Then the key words of each type of Chinese text combined with the search weight value in the Internet search engine are adjusted to effectively improve the accuracy of business keyword extraction in Weibo text. Because Weibo text data has many characteristics, such as large quantity, short and random content, there are some limitations in using traditional classification method and business information extraction algorithm to process Weibo text data. Considering that there are few effective features and colloquial features in a single Weibo text, this paper extends the feature words of the text from the aspects of similar words and collocation words to reduce the possibility of feature loss as far as possible. According to the characteristics of Weibo's large quantity of text and randomness of content, this paper puts forward a new text categorization method of Weibo based on the dispersion and dispersion of feature word categories. Considering the factors of Weibo's own forwarding number, comment number and massive scale, this paper improves the traditional TF-IDF algorithm. Using hadoop cloud computing platform and taking all Weibo information of individual user as computing unit, the improved TF-IDF algorithm is applied, and then the search weight value of words in Internet search engine is synthesized to adjust the weight. The effective extraction of commercial value keywords from massive data is realized. The experiment shows that the Weibo classification method has achieved good results in the classification of Weibo information. In the data processing and application scene of Weibo, the improved business keyword extraction algorithm of TF-IDF weight and word Internet search weight is integrated. It has good applicability and commercial effect. Combined with cloud computing platform, the efficiency of data processing is improved to a certain extent, which makes it feasible and effective to deal with the massive Weibo data set.
【學(xué)位授予單位】:杭州電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP393.092;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李華;趙文偉;;微博客:圖書館的下一個(gè)網(wǎng)絡(luò)新貴工具[J];圖書與情報(bào);2009年04期
2 谷瓊;朱莉;蔡之華;袁紅星;;基于決策樹技術(shù)的高校研究生信息庫(kù)數(shù)據(jù)挖掘研究[J];電子技術(shù)應(yīng)用;2006年01期
3 李靜梅,孫麗華,張巧榮,張春生;一種文本處理中的樸素貝葉斯分類器[J];哈爾濱工程大學(xué)學(xué)報(bào);2003年01期
4 張寧,賈自艷,史忠植;使用KNN算法的文本分類[J];計(jì)算機(jī)工程;2005年08期
5 洪家榮,丁明峰,李星原,王麗薇;一種新的決策樹歸納學(xué)習(xí)算法[J];計(jì)算機(jī)學(xué)報(bào);1995年06期
6 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實(shí)體識(shí)別研究[J];中文信息學(xué)報(bào);2006年01期
7 吳軍,王作英,禹鋒,王俠;漢語(yǔ)語(yǔ)料的自動(dòng)分類[J];中文信息學(xué)報(bào);1995年04期
8 劉開瑛,薛翠芳,鄭家恒,周曉強(qiáng);中文文本中抽取特征信息的區(qū)域與技術(shù)[J];中文信息學(xué)報(bào);1998年02期
9 于瀟;;Web2.0時(shí)代下微博廣告?zhèn)鞑ゲ呗苑治鯷J];新聞界;2011年03期
10 曹玉;;2010微博營(yíng)銷10案例[J];科技與企業(yè);2011年03期
本文編號(hào):2426761
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2426761.html