基于Hadoop平臺的LDA短文本分類算法研究

發(fā)布時間：2019-03-04 13:13

【摘要】：近年來,隨著即時通訊、微博等網(wǎng)絡(luò)應(yīng)用的發(fā)展,大量長度較短的文本信息應(yīng)運(yùn)而生。這些數(shù)據(jù)不僅增長速度快,且數(shù)量龐大。如何合理利用海量文本數(shù)據(jù),從中提取有價值的信息,便成為了當(dāng)下的研究熱點(diǎn)�；诙涛谋镜难芯磕壳霸诰W(wǎng)絡(luò)輿情分析、熱點(diǎn)話題發(fā)現(xiàn)、社交網(wǎng)絡(luò)、購物平臺推薦及信息安全等各個領(lǐng)域都受到廣泛應(yīng)用。短文本信息具有內(nèi)容長度短、特征稀疏、噪點(diǎn)多等特性,以至于傳統(tǒng)的文本分類方法效果非常不理想。本文在前人的研究基礎(chǔ)上,提出了一種基于共現(xiàn)關(guān)系LDA主題的短文本分類方法。利用潛在狄利克雷主題模型(Latent Dirichlet Allocation,LDA)對短文本進(jìn)行處理,得到"主題一詞"分布;再提取同時出現(xiàn)在多個主題中的詞語,建立共現(xiàn)詞集;然后通過計(jì)算共現(xiàn)詞集中各個詞語與各個主題的相關(guān)度,將與兩個以上主題具有近似相關(guān)度的詞語做進(jìn)一步篩選,建立混淆詞集。在進(jìn)行文本分類時,對于混淆詞集中的詞語,通過降低其權(quán)重來減少對分類結(jié)果的影響。為了提高本文方法的運(yùn)行效率,將本文方法與Hadoop平臺相結(jié)合,利用Hadoop分布式系統(tǒng)在處理海量數(shù)據(jù)的優(yōu)勢,來優(yōu)化本文分類方法的分類效率。文本實(shí)驗(yàn)使用兩種語料庫:新聞標(biāo)題語料庫和微博語料庫。實(shí)證過程制定兩種實(shí)驗(yàn)方案:先使用樣本較小的新聞標(biāo)題語料庫進(jìn)行算法可行性驗(yàn)證,通過與其他方法進(jìn)行對比,驗(yàn)證本文方法在分類效果方面的優(yōu)勢;隨后利用大樣本的微博語料庫在Hadoop平臺下測試本文方法是否在分類效率方面具有顯著提高。最后通過實(shí)驗(yàn)結(jié)果分析得出,文本提出的基于共現(xiàn)關(guān)系的LDA短文本分類方法以及將此分類方法與Hadoop平臺結(jié)合的分類效果效率均實(shí)現(xiàn)預(yù)期目標(biāo)。
[Abstract]:In recent years, with the development of instant messaging, Weibo and other network applications, a large number of short-length text information emerged as the times require. These data not only grow fast, but also a large number. How to make rational use of massive text data and extract valuable information from it has become a hot research topic. Short text-based research is widely used in network public opinion analysis, hot topic discovery, social network, shopping platform recommendation and information security and so on. The short text information has the characteristics of short content length, sparse features and many noise points, so that the traditional text classification method is not ideal. On the basis of previous research, this paper proposes a short text classification method based on co-occurrence relation LDA topic. The latent Dirichlet theme model (Latent Dirichlet Allocation,LDA) is used to process the short text to obtain the distribution of the word "theme", and then to extract the words that appear in multiple topics at the same time, and to set up a set of co-existing words. Then by calculating the correlation degree between each word and each topic in the co-occurrence word set, the words with approximate correlation degree with more than two topics are further screened, and the confused word set is established. When text classification is carried out, the influence on the classification result is reduced by reducing the weight of the confused words in the set of words. In order to improve the efficiency of this method, this paper combines this method with Hadoop platform, and makes use of the advantage of Hadoop distributed system in processing massive data to optimize the classification efficiency of this classification method. The text experiment uses two kinds of corpus: news title corpus and Weibo corpus. In the empirical process, two kinds of experimental schemes are formulated: firstly, the feasibility of the algorithm is verified by using the corpus of news headlines with smaller samples, and the advantages of this method in classification effect are verified by comparing with other methods; Then a large sample of Weibo corpus is used to test whether the proposed method has a significant improvement in classification efficiency under the Hadoop platform. Finally, through the analysis of experimental results, it is concluded that the proposed LDA short text classification method based on co-occurrence relation and the efficiency of combining this classification method with Hadoop platform can achieve the desired goal.
【學(xué)位授予單位】：天津財經(jīng)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 謝天宇;曹奇英;;基于Hadoop集群的分布式入侵檢測系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];微計(jì)算機(jī)信息;2012年09期

2 逄利華;張錦春;;基于Hadoop的分布式數(shù)據(jù)庫系統(tǒng)[J];辦公自動化;2014年05期

3 鄭瑋;;Hadoop釋放大數(shù)據(jù)潛能[J];軟件和信息服務(wù);2012年10期

4 劉爾凱;崔振東;;基于HADOOP技術(shù) 實(shí)現(xiàn)銀行歷史數(shù)據(jù)線上化研究[J];金融電子化;2014年01期

5 鄒群;;一種基于Hadoop的數(shù)字圖書存儲系統(tǒng)設(shè)計(jì)方案[J];黑龍江史志;2014年01期

6 諶章義;畢偉;向萬紅;王國安;吳愛國;;基于Hadoop的海量電費(fèi)數(shù)據(jù)處理模型[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2014年05期

7 ;大數(shù)據(jù)不等于Hadoop[J];辦公自動化;2014年06期

8 ;保障Hadoop數(shù)據(jù)安全的十大措施[J];計(jì)算機(jī)與網(wǎng)絡(luò);2013年08期

9 王峰;雷葆華;;Hadoop分布式文件系統(tǒng)的模型分析[J];電信科學(xué);2010年12期

10 蘇小會;何婧媛;;Hadoop中任務(wù)調(diào)度算法的改進(jìn)[J];電子設(shè)計(jì)工程;2012年22期

相關(guān)重要報紙文章前8條

1 本報記者郭濤;機(jī)器大數(shù)據(jù)也離不開Hadoop[N];中國計(jì)算機(jī)報;2013年

2 本報記者王星;Hadoop引發(fā)大數(shù)據(jù)之戰(zhàn)[N];電腦報;2012年

3 本報記者鄒大斌;Hadoop一體機(jī)降低大數(shù)據(jù)門檻[N];計(jì)算機(jī)世界;2012年

4 孫定;云計(jì)算、大數(shù)據(jù)與Hadoop[N];計(jì)算機(jī)世界;2011年

5 樂天　編譯;Hadoop：打開大數(shù)據(jù)之門的金鑰匙[N];計(jì)算機(jī)世界;2012年

6 范范　編譯;Hadoop用戶可以使用多種搜索引擎[N];網(wǎng)絡(luò)世界;2013年

7 波波　編譯;Hadoop、Web 2.0為磁帶帶來新商機(jī)[N];網(wǎng)絡(luò)世界;2013年

8 本報記者郭濤;讓更多人能夠使用Hadoop[N];中國計(jì)算機(jī)報;2012年

相關(guān)博士學(xué)位論文前1條

1 宋亞奇;云平臺下電力設(shè)備監(jiān)測大數(shù)據(jù)存儲優(yōu)化與并行處理技術(shù)研究[D];華北電力大學(xué)(北京);2016年

相關(guān)碩士學(xué)位論文前10條

1 劉君;基于Hadoop技術(shù)的氣象數(shù)據(jù)采集及數(shù)據(jù)挖掘平臺的研究[D];天津理工大學(xué);2015年

2 譚旭;基于物流數(shù)據(jù)的快遞網(wǎng)絡(luò)分析與建模[D];浙江大學(xué);2015年

3 趙偉;基于Hadoop的數(shù)據(jù)挖掘算法并行化研究[D];西南交通大學(xué);2015年

4 趙振崇;基于Hadoop的決策樹挖掘算法的研究[D];蘭州大學(xué);2015年

5 郭凱振;基于Hadoop的分布式計(jì)算系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];大連海事大學(xué);2015年

6 白亮;基于Hadoop的民航高價值旅客發(fā)現(xiàn)方法研究[D];中國民航大學(xué);2015年

7 席屏;基于Hadoop的視頻大數(shù)據(jù)智能預(yù)警系統(tǒng)應(yīng)用研究[D];江蘇科技大學(xué);2015年

8 董立明;基于HADOOP的分布式推薦引擎[D];復(fù)旦大學(xué);2013年

9 陸藝達(dá);基于Hadoop分布式計(jì)算框架的垃圾短信群發(fā)檢測系統(tǒng)[D];復(fù)旦大學(xué);2013年

10 沈德利;基于Hadoop的密文檢索關(guān)鍵技術(shù)研究[D];西安電子科技大學(xué);2014年

，

本文編號：2434294

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2434294.html

上一篇：半監(jiān)督聚類算法研究及植物葉片識別應(yīng)用
下一篇：基于倒排索引的集合T覆蓋查詢算法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop平臺的LDA短文本分類算法研究