基于hadoop的微博文本分類及商業(yè)詞抽取

發(fā)布時(shí)間：2019-02-19 18:49

【摘要】：隨著計(jì)算機(jī)技術(shù)和網(wǎng)絡(luò)技術(shù)的飛速發(fā)展，微博已經(jīng)普及成為國(guó)內(nèi)的一大新型媒體。微博用戶基數(shù)的迅速膨脹，加上信息的逐級(jí)傳播，與之俱來(lái)的問題是微博數(shù)據(jù)規(guī)模達(dá)到空前水平。面對(duì)微博服務(wù)迅猛發(fā)展中所產(chǎn)生的海量文本數(shù)據(jù)，如何準(zhǔn)確有效的從中定向發(fā)現(xiàn)并獲取所需要的有較高商業(yè)價(jià)值的資料和信息，進(jìn)而提高廣告精準(zhǔn)度成為各微博平臺(tái)數(shù)據(jù)研究處理的一大目標(biāo)，本文將對(duì)如何有效的從海量微博文本數(shù)據(jù)中發(fā)現(xiàn)和抽取商業(yè)關(guān)鍵詞進(jìn)行研究。為了更有針對(duì)性的進(jìn)行商業(yè)關(guān)鍵詞抽取，首先對(duì)海量微博數(shù)據(jù)進(jìn)行了文本分類，一方面降低了單次數(shù)據(jù)處理的規(guī)模，另一方面對(duì)同類數(shù)據(jù)進(jìn)行處理研究將更有針對(duì)性。再對(duì)各個(gè)類別中文本的關(guān)鍵詞結(jié)合互聯(lián)網(wǎng)搜索引擎中的搜索權(quán)值進(jìn)行調(diào)權(quán)，有效提高了微博文本中商業(yè)關(guān)鍵詞抽取的精準(zhǔn)度。由于微博文本數(shù)據(jù)具有總體數(shù)量多、單條簡(jiǎn)短及內(nèi)容隨意性大等特性，在利用傳統(tǒng)分類方法及商業(yè)信息提取算法對(duì)其進(jìn)行處理時(shí)存在一定的局限性。本文考慮到單條微博文本信息簡(jiǎn)短包含的有效特征少，且內(nèi)容比較口語(yǔ)化的特性，從相似詞及搭配詞方面對(duì)文本的特征詞進(jìn)行了擴(kuò)展，盡量降低特征丟失的可能性；結(jié)合微博文本數(shù)量多及內(nèi)容隨意性大的特性，提出了一種基于特征詞類別分散性及分散程度的微博文本分類方法�？紤]到微博自有的轉(zhuǎn)發(fā)數(shù)、評(píng)論數(shù)及海量規(guī)模等因素，本文對(duì)傳統(tǒng)的TF-IDF算法進(jìn)行了相關(guān)改進(jìn)，利用hadoop云計(jì)算平臺(tái)并以單個(gè)用戶的所有微博信息作為計(jì)算單元應(yīng)用改進(jìn)的TF-IDF算法，再綜合詞語(yǔ)在互聯(lián)網(wǎng)搜索引擎中的搜索權(quán)值進(jìn)行調(diào)權(quán)，實(shí)現(xiàn)了從海量數(shù)據(jù)中對(duì)具有商業(yè)價(jià)值關(guān)鍵詞的有效抽取。實(shí)驗(yàn)表明，該微博分類方法在微博信息的分類中取得了較好的效果，在微博數(shù)據(jù)處理應(yīng)用場(chǎng)景中，綜合了改進(jìn)的TF-IDF權(quán)重及詞語(yǔ)互聯(lián)網(wǎng)搜索權(quán)重的商業(yè)關(guān)鍵詞抽取算法，，具有較好的適用性及商業(yè)效果。而結(jié)合了云計(jì)算平臺(tái)后,一定程度上提高了數(shù)據(jù)處理效率，使得對(duì)海量微博數(shù)據(jù)集上的處理變得可行有效。
[Abstract]:With the rapid development of computer technology and network technology, Weibo has become a new media in China. Weibo's rapid expansion of the user base, coupled with the gradual dissemination of information, comes with the question of the unprecedented scale of Weibo data. In the face of the massive text data produced by Weibo in the rapid development of service, how to accurately and effectively find and obtain the materials and information of high commercial value needed from them, To improve the accuracy of advertising has become a major target of data processing in Weibo platform. This paper will study how to effectively find and extract commercial keywords from the massive Weibo text data. In order to extract business keywords more pertinently, the text classification of massive Weibo data is carried out first, which reduces the scale of single data processing on the one hand, and studies the same data processing on the other hand, it will be more targeted. Then the key words of each type of Chinese text combined with the search weight value in the Internet search engine are adjusted to effectively improve the accuracy of business keyword extraction in Weibo text. Because Weibo text data has many characteristics, such as large quantity, short and random content, there are some limitations in using traditional classification method and business information extraction algorithm to process Weibo text data. Considering that there are few effective features and colloquial features in a single Weibo text, this paper extends the feature words of the text from the aspects of similar words and collocation words to reduce the possibility of feature loss as far as possible. According to the characteristics of Weibo's large quantity of text and randomness of content, this paper puts forward a new text categorization method of Weibo based on the dispersion and dispersion of feature word categories. Considering the factors of Weibo's own forwarding number, comment number and massive scale, this paper improves the traditional TF-IDF algorithm. Using hadoop cloud computing platform and taking all Weibo information of individual user as computing unit, the improved TF-IDF algorithm is applied, and then the search weight value of words in Internet search engine is synthesized to adjust the weight. The effective extraction of commercial value keywords from massive data is realized. The experiment shows that the Weibo classification method has achieved good results in the classification of Weibo information. In the data processing and application scene of Weibo, the improved business keyword extraction algorithm of TF-IDF weight and word Internet search weight is integrated. It has good applicability and commercial effect. Combined with cloud computing platform, the efficiency of data processing is improved to a certain extent, which makes it feasible and effective to deal with the massive Weibo data set.
【學(xué)位授予單位】：杭州電子科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 李華;趙文偉;;微博客:圖書館的下一個(gè)網(wǎng)絡(luò)新貴工具[J];圖書與情報(bào);2009年04期

2 谷瓊;朱莉;蔡之華;袁紅星;;基于決策樹技術(shù)的高校研究生信息庫(kù)數(shù)據(jù)挖掘研究[J];電子技術(shù)應(yīng)用;2006年01期

3 李靜梅,孫麗華,張巧榮,張春生;一種文本處理中的樸素貝葉斯分類器[J];哈爾濱工程大學(xué)學(xué)報(bào);2003年01期

4 張寧,賈自艷,史忠植;使用KNN算法的文本分類[J];計(jì)算機(jī)工程;2005年08期

5 洪家榮，丁明峰，李星原，王麗薇;一種新的決策樹歸納學(xué)習(xí)算法[J];計(jì)算機(jī)學(xué)報(bào);1995年06期

6 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實(shí)體識(shí)別研究[J];中文信息學(xué)報(bào);2006年01期

7 吳軍，王作英，禹鋒，王俠;漢語(yǔ)語(yǔ)料的自動(dòng)分類[J];中文信息學(xué)報(bào);1995年04期

8 劉開瑛,薛翠芳,鄭家恒,周曉強(qiáng);中文文本中抽取特征信息的區(qū)域與技術(shù)[J];中文信息學(xué)報(bào);1998年02期

9 于瀟;;Web2.0時(shí)代下微博廣告?zhèn)鞑ゲ呗苑治鯷J];新聞界;2011年03期

10 曹玉;;2010微博營(yíng)銷10案例[J];科技與企業(yè);2011年03期

本文編號(hào)：2426761

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2426761.html

上一篇：搜索引擎中關(guān)鍵字廣告對(duì)隱私權(quán)的侵犯
下一篇：讓搜索蜘蛛戀上你的網(wǎng)站

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于hadoop的微博文本分類及商業(yè)詞抽取