基于隨機森林和Spark的并行文本分類算法研究

發(fā)布時間：2018-11-25 09:04

【摘要】：文本分類問題廣泛存在于搜索引擎、信息檢索等應(yīng)用中。尤其是信息技術(shù)廣為發(fā)展的時代,有效地對大數(shù)據(jù)中的文本進(jìn)行分類是數(shù)據(jù)挖掘研究的重要內(nèi)容之一。本文研究了隨機森林算法在海量文本分類中的應(yīng)用,隨機森林算法是一種集成算法,能有效的處理海量數(shù)據(jù)。隨機森林分類算法通過隨機性的引入,在獲得較好分類效果的同時很好的解決決策樹算法過擬合的問題。隨機森林算法在抽樣建立決策樹的過程中,可能會生成較差的隨機子空間,從而使得對應(yīng)的決策樹分類能力較弱,針對此特點本文采用基于粗糙集理論的隨機森林算法調(diào)整這些決策樹的分類能力。同時根據(jù)隨機森林中各決策樹的分類能力,在隨機森林算法中采用加權(quán)投票方法,實驗表明基于粗糙集理論的隨機森林算法在大多數(shù)數(shù)據(jù)集上分類性能優(yōu)于KNN、樸素貝葉斯、決策樹和傳統(tǒng)的隨機森林等算法。MapReduce框架是目前應(yīng)用最廣泛的大數(shù)據(jù)并行計算框架,MapReduce框架下的并行文本分類算法的研究得到了較多的關(guān)注。MapReduce框架的缺點是,在并行計算的過程中會將中間結(jié)果存儲在HDFS上,造成大量的IO開銷等；而Spark框架是基于內(nèi)存計算的并行框架,在執(zhí)行過程中并不會直接將中間結(jié)果存儲在磁盤(只有當(dāng)內(nèi)存不足時才會將數(shù)據(jù)部分緩存到磁盤),因此Spark框架的執(zhí)行效率相對較好。本文研究了隨機森林算法和Spark框架在海量文本分類上的應(yīng)用,并同MapReduce框架下的并行文本分類進(jìn)行了簡單比較,實驗表明Spark框架上并行文本分類并行性能較好,且優(yōu)于MapReduce框架下并行文本分類。最后,為方便用戶對集群的使用,設(shè)計了基于B/S結(jié)構(gòu)的并行文本分類系統(tǒng),用于遠(yuǎn)程提交任務(wù)、集群監(jiān)控和數(shù)據(jù)下載等。
[Abstract]:Text classification is widely used in search engine and information retrieval. Especially in the era of extensive development of information technology, effectively classifying texts in big data is one of the important contents of data mining research. In this paper, the application of stochastic forest algorithm in massive text classification is studied. Stochastic forest algorithm is an ensemble algorithm, which can deal with mass data effectively. By introducing randomness into the stochastic forest classification algorithm, the problem of over-fitting of decision tree algorithm is well solved while the classification effect is better. In the process of establishing decision tree by sampling, the random forest algorithm may generate poor random subspace, which makes the classification ability of the corresponding decision tree weak. In this paper, the classification ability of these decision trees is adjusted by using the stochastic forest algorithm based on rough set theory. At the same time, according to the classification ability of each decision tree in the random forest, the weighted voting method is used in the random forest algorithm. The experiment shows that the classification performance of the stochastic forest algorithm based on rough set theory is better than that of KNN, naive Bayes on most data sets. Decision tree and traditional stochastic forest algorithms. MapReduce framework is the most widely used big data parallel computing framework at present. The research of parallel text classification algorithm under MapReduce framework has attracted more attention. The disadvantage of MapReduce framework is that, In the process of parallel computing, the intermediate results will be stored on the HDFS, resulting in a large amount of IO overhead. The Spark framework is a parallel framework based on memory computing, and the intermediate results are not stored directly on disk (only when the memory is out of memory, the data can be cached to the disk), so the execution efficiency of the Spark framework is relatively good. In this paper, the application of stochastic forest algorithm and Spark framework in massive text classification is studied and compared with the parallel text classification based on MapReduce framework. The experiments show that the parallel performance of parallel text classification based on Spark framework is good. And it is better than parallel text classification in MapReduce framework. Finally, a parallel text classification system based on B / S structure is designed to facilitate the users to use the cluster. The system is used for remote submission tasks, cluster monitoring and data downloading.
【學(xué)位授予單位】：西南交通大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 劉足華;熊惠霖;;基于隨機森林的目標(biāo)檢測與定位[J];計算機工程;2012年13期

2 董師師;黃哲學(xué);;隨機森林理論淺析[J];集成技術(shù);2013年01期

3 王象剛;;基于K均值隨機森林快速算法及入侵檢測中的應(yīng)用[J];科技通報;2013年08期

4 陳姝;彭小寧;;基于粒子濾波和在線隨機森林分類的目標(biāo)跟蹤[J];江蘇大學(xué)學(xué)報(自然科學(xué)版);2014年02期

5 羅知林;陳挺;蔡皖東;;一個基于隨機森林的微博轉(zhuǎn)發(fā)預(yù)測算法[J];計算機科學(xué);2014年04期

6 王麗婷;丁曉青;方馳;;基于隨機森林的人臉關(guān)鍵點精確定位方法[J];清華大學(xué)學(xué)報(自然科學(xué)版);2009年04期

7 李建更;高志坤;;隨機森林針對小樣本數(shù)據(jù)類權(quán)重設(shè)置[J];計算機工程與應(yīng)用;2009年26期

8 張建;武東英;劉慧生;;基于隨機森林的流量分類方法[J];信息工程大學(xué)學(xué)報;2012年05期

9 吳華芹;;基于訓(xùn)練集劃分的隨機森林算法[J];科技通報;2013年10期

10 張華偉;王明文;甘麗新;;基于隨機森林的文本分類模型研究[J];山東大學(xué)學(xué)報(理學(xué)版);2006年03期

相關(guān)會議論文前7條

1 謝程利;王金橋;盧漢清;;核森林及其在目標(biāo)檢測中的應(yīng)用[A];第六屆和諧人機環(huán)境聯(lián)合學(xué)術(shù)會議（HHME2010)、第19屆全國多媒體學(xué)術(shù)會議（NCMT2010）、第6屆全國人機交互學(xué)術(shù)會議（CHCI2010）、第5屆全國普適計算學(xué)術(shù)會議（PCC2010）論文集[C];2010年

2 武曉巖;方慶偉;;基因表達(dá)數(shù)據(jù)分析的隨機森林方法及算法改進(jìn)[A];黑龍江省第十次統(tǒng)計科學(xué)討論會論文集[C];2008年

3 張?zhí)忑?梁龍;王康;李華;;隨機森林結(jié)合激光誘導(dǎo)擊穿光譜技術(shù)用于的鋼鐵分類[A];中國化學(xué)會第29屆學(xué)術(shù)年會摘要集——第19分會：化學(xué)信息學(xué)與化學(xué)計量學(xué)[C];2014年

4 相玉紅;張卓勇;;組蛋白去乙�；敢种苿┑臉�(gòu)效關(guān)系研究[A];第十一屆全國計算（機）化學(xué)學(xué)術(shù)會議論文摘要集[C];2011年

5 張濤;李貞子;武曉巖;李康;;隨機森林回歸分析方法及在代謝組學(xué)中的應(yīng)用[A];2011年中國衛(wèi)生統(tǒng)計學(xué)年會會議論文集[C];2011年

6 馮飛翔;馮輔周;江鵬程;劉菁;劉建敏;;隨機森林和k-近鄰法在某型坦克變速箱狀態(tài)識別中的應(yīng)用[A];第八屆全國轉(zhuǎn)子動力學(xué)學(xué)術(shù)討論會論文集[C];2008年

7 曹東升;許青松;梁逸曾;陳憲;李洪東;;組合樹的集合體和后向消除策略去分類P-糖蛋白化合物[A];第十屆全國計算(機)化學(xué)學(xué)術(shù)會議論文摘要集[C];2009年

相關(guān)博士學(xué)位論文前4條

1 曹正鳳;隨機森林算法優(yōu)化研究[D];首都經(jīng)濟貿(mào)易大學(xué);2014年

2 雷震;隨機森林及其在遙感影像處理中應(yīng)用研究[D];上海交通大學(xué);2012年

3 岳明;基于隨機森林和規(guī)則集成法的酒類市場預(yù)測與發(fā)展戰(zhàn)略[D];天津大學(xué);2008年

4 李書艷;單點氨基酸多態(tài)性與疾病相關(guān)關(guān)系的預(yù)測及其機制研究[D];蘭州大學(xué);2010年

相關(guān)碩士學(xué)位論文前10條

1 錢維;藥品不良反應(yīng)監(jiān)測中隨機森林方法的建立與實現(xiàn)[D];第二軍醫(yī)大學(xué);2012年

2 韓燕龍;基于隨機森林的指數(shù)化投資組合構(gòu)建研究[D];華南理工大學(xué);2015年

3 賀捷;隨機森林在文本分類中的應(yīng)用[D];華南理工大學(xué);2015年

4 張文婷;交通環(huán)境下基于改進(jìn)霍夫森林的目標(biāo)檢測與跟蹤[D];華南理工大學(xué);2015年

5 李強;基于多視角特征融合與隨機森林的蛋白質(zhì)結(jié)晶預(yù)測[D];南京理工大學(xué);2015年

6 朱玟謙;一種收斂性隨機森林在人臉檢測中的應(yīng)用研究[D];武漢理工大學(xué);2015年

7 肖宇;基于序列圖像的手勢檢測與識別算法研究[D];電子科技大學(xué);2014年

8 李慧;一種改進(jìn)的隨機森林并行分類方法在運營商大數(shù)據(jù)的應(yīng)用[D];電子科技大學(xué);2015年

9 趙亞紅;面向多類標(biāo)分類的隨機森林算法研究[D];哈爾濱工業(yè)大學(xué);2014年

10 黎成;基于隨機森林和ReliefF的致病SNP識別方法[D];西安電子科技大學(xué);2014年

，

本文編號：2355549

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2355549.html

上一篇：基于光流方向信息熵統(tǒng)計的微表情捕捉
下一篇：基于圖像內(nèi)容的服裝分類和推薦方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于隨機森林和Spark的并行文本分類算法研究