基于Hadoop平臺的LDA短文本分類算法研究
[Abstract]:In recent years, with the development of instant messaging, Weibo and other network applications, a large number of short-length text information emerged as the times require. These data not only grow fast, but also a large number. How to make rational use of massive text data and extract valuable information from it has become a hot research topic. Short text-based research is widely used in network public opinion analysis, hot topic discovery, social network, shopping platform recommendation and information security and so on. The short text information has the characteristics of short content length, sparse features and many noise points, so that the traditional text classification method is not ideal. On the basis of previous research, this paper proposes a short text classification method based on co-occurrence relation LDA topic. The latent Dirichlet theme model (Latent Dirichlet Allocation,LDA) is used to process the short text to obtain the distribution of the word "theme", and then to extract the words that appear in multiple topics at the same time, and to set up a set of co-existing words. Then by calculating the correlation degree between each word and each topic in the co-occurrence word set, the words with approximate correlation degree with more than two topics are further screened, and the confused word set is established. When text classification is carried out, the influence on the classification result is reduced by reducing the weight of the confused words in the set of words. In order to improve the efficiency of this method, this paper combines this method with Hadoop platform, and makes use of the advantage of Hadoop distributed system in processing massive data to optimize the classification efficiency of this classification method. The text experiment uses two kinds of corpus: news title corpus and Weibo corpus. In the empirical process, two kinds of experimental schemes are formulated: firstly, the feasibility of the algorithm is verified by using the corpus of news headlines with smaller samples, and the advantages of this method in classification effect are verified by comparing with other methods; Then a large sample of Weibo corpus is used to test whether the proposed method has a significant improvement in classification efficiency under the Hadoop platform. Finally, through the analysis of experimental results, it is concluded that the proposed LDA short text classification method based on co-occurrence relation and the efficiency of combining this classification method with Hadoop platform can achieve the desired goal.
【學(xué)位授予單位】:天津財(cái)經(jīng)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 謝天宇;曹奇英;;基于Hadoop集群的分布式入侵檢測系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];微計(jì)算機(jī)信息;2012年09期
2 逄利華;張錦春;;基于Hadoop的分布式數(shù)據(jù)庫系統(tǒng)[J];辦公自動化;2014年05期
3 鄭瑋;;Hadoop釋放大數(shù)據(jù)潛能[J];軟件和信息服務(wù);2012年10期
4 劉爾凱;崔振東;;基于HADOOP技術(shù) 實(shí)現(xiàn)銀行歷史數(shù)據(jù)線上化研究[J];金融電子化;2014年01期
5 鄒群;;一種基于Hadoop的數(shù)字圖書存儲系統(tǒng)設(shè)計(jì)方案[J];黑龍江史志;2014年01期
6 諶章義;畢偉;向萬紅;王國安;吳愛國;;基于Hadoop的海量電費(fèi)數(shù)據(jù)處理模型[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2014年05期
7 ;大數(shù)據(jù)不等于Hadoop[J];辦公自動化;2014年06期
8 ;保障Hadoop數(shù)據(jù)安全的十大措施[J];計(jì)算機(jī)與網(wǎng)絡(luò);2013年08期
9 王峰;雷葆華;;Hadoop分布式文件系統(tǒng)的模型分析[J];電信科學(xué);2010年12期
10 蘇小會;何婧媛;;Hadoop中任務(wù)調(diào)度算法的改進(jìn)[J];電子設(shè)計(jì)工程;2012年22期
相關(guān)重要報(bào)紙文章 前8條
1 本報(bào)記者 郭濤;機(jī)器大數(shù)據(jù)也離不開Hadoop[N];中國計(jì)算機(jī)報(bào);2013年
2 本報(bào)記者 王星;Hadoop引發(fā)大數(shù)據(jù)之戰(zhàn)[N];電腦報(bào);2012年
3 本報(bào)記者 鄒大斌;Hadoop一體機(jī)降低大數(shù)據(jù)門檻[N];計(jì)算機(jī)世界;2012年
4 孫定;云計(jì)算、大數(shù)據(jù)與Hadoop[N];計(jì)算機(jī)世界;2011年
5 樂天 編譯;Hadoop:打開大數(shù)據(jù)之門的金鑰匙[N];計(jì)算機(jī)世界;2012年
6 范范 編譯;Hadoop用戶可以使用多種搜索引擎[N];網(wǎng)絡(luò)世界;2013年
7 波波 編譯;Hadoop、Web 2.0為磁帶帶來新商機(jī)[N];網(wǎng)絡(luò)世界;2013年
8 本報(bào)記者 郭濤;讓更多人能夠使用Hadoop[N];中國計(jì)算機(jī)報(bào);2012年
相關(guān)博士學(xué)位論文 前1條
1 宋亞奇;云平臺下電力設(shè)備監(jiān)測大數(shù)據(jù)存儲優(yōu)化與并行處理技術(shù)研究[D];華北電力大學(xué)(北京);2016年
相關(guān)碩士學(xué)位論文 前10條
1 劉君;基于Hadoop技術(shù)的氣象數(shù)據(jù)采集及數(shù)據(jù)挖掘平臺的研究[D];天津理工大學(xué);2015年
2 譚旭;基于物流數(shù)據(jù)的快遞網(wǎng)絡(luò)分析與建模[D];浙江大學(xué);2015年
3 趙偉;基于Hadoop的數(shù)據(jù)挖掘算法并行化研究[D];西南交通大學(xué);2015年
4 趙振崇;基于Hadoop的決策樹挖掘算法的研究[D];蘭州大學(xué);2015年
5 郭凱振;基于Hadoop的分布式計(jì)算系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];大連海事大學(xué);2015年
6 白亮;基于Hadoop的民航高價值旅客發(fā)現(xiàn)方法研究[D];中國民航大學(xué);2015年
7 席屏;基于Hadoop的視頻大數(shù)據(jù)智能預(yù)警系統(tǒng)應(yīng)用研究[D];江蘇科技大學(xué);2015年
8 董立明;基于HADOOP的分布式推薦引擎[D];復(fù)旦大學(xué);2013年
9 陸藝達(dá);基于Hadoop分布式計(jì)算框架的垃圾短信群發(fā)檢測系統(tǒng)[D];復(fù)旦大學(xué);2013年
10 沈德利;基于Hadoop的密文檢索關(guān)鍵技術(shù)研究[D];西安電子科技大學(xué);2014年
,本文編號:2434294
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2434294.html