分布式環(huán)境下企業(yè)新聞信息分類子系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時間：2018-08-27 09:03

【摘要】：近年來,隨著互聯(lián)網(wǎng)的迅猛發(fā)展,各種各樣的新聞層出不窮,新聞信息在人們的文化、生活等各個方面發(fā)揮著越來越重要的作用。如何對大量的新聞數(shù)據(jù)進(jìn)行收集、整理,并突顯出人們想要查找的新聞,是本文研究的主要問題。針對目前常見的搜索引擎存在著查找到的新聞信息過多,與主題關(guān)聯(lián)性不強(qiáng)等問題,本文提出并設(shè)計(jì)了一個面向企業(yè)的新聞分類子系統(tǒng)。該系統(tǒng)具備新聞采集、信息處理及新聞?wù)故镜裙δ堋Ｆ髽I(yè)用戶可以利用該系統(tǒng)快速、準(zhǔn)確地獲取與其行業(yè)相關(guān)的新聞。首先,系統(tǒng)設(shè)計(jì)了網(wǎng)絡(luò)爬蟲模塊。使用廣度優(yōu)先算法編寫了爬蟲軟件,通過該軟件可以實(shí)現(xiàn)對企業(yè)感興趣新聞信息高效的采集與識別。其次,設(shè)計(jì)并實(shí)現(xiàn)了文本分類模塊。在該模塊中,使用分布式貝葉斯算法對新聞文本進(jìn)行分類。在分類過程中,文本的預(yù)處理、特征選擇以及向量化需要大量計(jì)算;在模型訓(xùn)練時,也存在著訓(xùn)練時間長、數(shù)據(jù)庫存儲容量有限等問題。為了解決以上問題,本文搭建了 Hadoop分布式計(jì)算平臺,利用MapReduce并行計(jì)算模型對文本分類過程中的不同階段進(jìn)行了分布式并行處理,并建立Hive數(shù)據(jù)倉庫以解決占用存儲空間大的問題。當(dāng)面臨大量新增數(shù)據(jù)時,傳統(tǒng)的貝葉斯方法需要將之前的所有樣本數(shù)據(jù)全部重新學(xué)習(xí)一次,這樣不僅會耗費(fèi)大量時間,而且操作起來也相當(dāng)麻煩。針對這種情況,本文引用了傳統(tǒng)的增量學(xué)習(xí)方法,設(shè)計(jì)并實(shí)現(xiàn)了增量式貝葉斯算法,該方法不用重新訓(xùn)練數(shù)據(jù),只需對原有的數(shù)據(jù)進(jìn)行修正。最后設(shè)計(jì)了一個面向企業(yè)新聞信息的分類子系統(tǒng),主要包括信息采集、文本預(yù)處理、特征提取、分類器構(gòu)造、分類性能評估和增量學(xué)習(xí)幾個流程,并對系統(tǒng)的幾個模塊功能進(jìn)行了測試。本系統(tǒng)利用爬蟲進(jìn)行新聞信息的獲取,并在Hadoop環(huán)境下對新聞信息進(jìn)行分類。通過測試表明,在大規(guī)模新聞信息的情況下,Hadoop下的增量分類器相比于傳統(tǒng)的貝葉斯分類器算法準(zhǔn)確率提高4%左右,表現(xiàn)出了良好的執(zhí)行效率及較高的拓展性。本文給出了網(wǎng)絡(luò)新聞文本分類的實(shí)現(xiàn)方案,對其它領(lǐng)域的文本分類具有借鑒意義。
[Abstract]:In recent years, with the rapid development of the Internet, all kinds of news emerge in endlessly. News information plays a more and more important role in people's culture, life and other aspects. How to collect, sort out and highlight the news that people want to find is the main problem of this paper. Aiming at the problems of finding too much news information and not strong relevance to the topic in the common search engines, this paper proposes and designs an enterprise-oriented news classification subsystem. The system has the functions of news collection, information processing and news display. Enterprise users can use the system to quickly and accurately access news related to their industry. Firstly, the network crawler module is designed. The crawler software is programmed by using the breadth-first algorithm, through which the information of interest to enterprises can be collected and recognized efficiently. Secondly, the text classification module is designed and implemented. In this module, distributed Bayesian algorithm is used to classify news texts. In the process of classification, text preprocessing, feature selection and vectorization need a lot of computation, while in model training, there are many problems such as long training time and limited storage capacity of database. In order to solve the above problems, the Hadoop distributed computing platform is built, and the MapReduce parallel computing model is used to process the different stages of text classification. Hive data warehouse is established to solve the problem of occupying large storage space. When faced with a large number of new data, the traditional Bayesian method needs to re-learn all the previous sample data, which will not only consume a lot of time, but also be very troublesome to operate. In this paper, the traditional incremental learning method is cited, and an incremental Bayesian algorithm is designed and implemented. The method does not need to retrain the data, but only needs to modify the original data. Finally, a classification subsystem for enterprise news information is designed, which includes information collection, text preprocessing, feature extraction, classifier construction, classification performance evaluation and incremental learning. Several module functions of the system are tested. This system uses crawler to obtain news information, and classifies news information under Hadoop environment. The test results show that the accuracy of Hadoop incremental classifier is about 4% higher than that of the traditional Bayesian classifier under the condition of large-scale news information. It shows good execution efficiency and high expansibility. This paper gives the implementation scheme of network news text classification, which can be used for reference in other fields.
【學(xué)位授予單位】：延邊大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13;TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 楊靜;張健沛;劉大昕;;基于多支持向量機(jī)分類器的增量學(xué)習(xí)算法研究[J];哈爾濱工程大學(xué)學(xué)報(bào);2006年01期

2 秦玉平;王秀坤;王春立;;實(shí)現(xiàn)兼類樣本類增量學(xué)習(xí)的一種算法[J];控制與決策;2009年01期

3 秦玉平;王秀坤;王春立;;實(shí)現(xiàn)兼類樣本增量學(xué)習(xí)的一種算法[J];計(jì)算機(jī)應(yīng)用與軟件;2009年08期

4 秦玉平;陳一荻;王春立;王秀坤;;一種新的類增量學(xué)習(xí)方法[J];計(jì)算機(jī)工程與應(yīng)用;2011年34期

5 時建中;程龍生;;基于增量學(xué)習(xí)系統(tǒng)的財(cái)務(wù)危機(jī)動態(tài)預(yù)警[J];技術(shù)經(jīng)濟(jì);2012年05期

6 王洪波;趙光宙;齊冬蓮;盧達(dá);;一類支持向量機(jī)的快速增量學(xué)習(xí)方法[J];浙江大學(xué)學(xué)報(bào)(工學(xué)版);2012年07期

7 秦玉平;倫淑嫻;王秀坤;;一種新的兼類樣本類增量學(xué)習(xí)算法[J];計(jì)算機(jī)科學(xué);2012年09期

8 姜卯生,王浩,姚宏亮;樸素貝葉斯分類器增量學(xué)習(xí)序列算法研究[J];計(jì)算機(jī)工程與應(yīng)用;2004年14期

9 劉梅,權(quán)太范,姚天賓;基于增量學(xué)習(xí)神經(jīng)模糊網(wǎng)絡(luò)的機(jī)動目標(biāo)跟蹤[J];電子學(xué)報(bào);2005年11期

10 李祥納;艾青;秦玉平;劉衛(wèi)江;;支持向量機(jī)增量學(xué)習(xí)算法綜述[J];渤海大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年02期

相關(guān)會議論文前8條

1 秦亮;唐靜;史賢俊;肖支才;;一種改進(jìn)的支持向量機(jī)增量學(xué)習(xí)算法[A];2011年中國智能自動化學(xué)術(shù)會議論文集（第一分冊）[C];2011年

2 羅長升;段建國;許洪波;郭莉;;基于拉推策略的文本分類增量學(xué)習(xí)研究[A];第三屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會議論文集[C];2007年

3 張慶彬;吳惕華;劉波;;一種改進(jìn)的基于群體的增量學(xué)習(xí)算法[A];第二十六屆中國控制會議論文集[C];2007年

4 張健沛;李忠偉;楊靜;;一種基于多支持向量機(jī)的并行增量學(xué)習(xí)方法(英文)[A];第二十二屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報(bào)告篇）[C];2005年

5 王悅凱;吳曉峰;翁巨揚(yáng);;Where-What網(wǎng)絡(luò)增量學(xué)習(xí)特性探究[A];第一屆全國神經(jīng)動力學(xué)學(xué)術(shù)會議程序手冊 & 論文摘要集[C];2012年

6 趙瑩;萬福永;;支持向量機(jī)的增量學(xué)習(xí)算法及其在多類分類問題中的應(yīng)用[A];第25屆中國控制會議論文集（下冊）[C];2006年

7 劉欣;章勇;王娟;;增量學(xué)習(xí)的TFIDF_NB協(xié)同訓(xùn)練分類算法[A];中國電子學(xué)會第十六屆信息論學(xué)術(shù)年會論文集[C];2009年

8 宮義山;錢娜;;貝葉斯網(wǎng)絡(luò)結(jié)構(gòu)在線學(xué)習(xí)算法及應(yīng)用[A];科學(xué)發(fā)展與社會責(zé)任（A卷）——第五屆沈陽科學(xué)學(xué)術(shù)年會文集[C];2008年

相關(guān)博士學(xué)位論文前4條

1 孫宇;針對含有概念漂移問題的增量學(xué)習(xí)算法研究[D];中國科學(xué)技術(shù)大學(xué);2017年

2 李敬;增量學(xué)習(xí)及其在圖像識別中的應(yīng)用[D];上海交通大學(xué);2008年

3 段華;支持向量機(jī)的增量學(xué)習(xí)算法研究[D];上海交通大學(xué);2008年

4 趙強(qiáng)利;基于選擇性集成的在線機(jī)器學(xué)習(xí)關(guān)鍵技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2010年

相關(guān)碩士學(xué)位論文前10條

1 郝運(yùn)河;基于增量學(xué)習(xí)的復(fù)雜環(huán)境下道路識別算法研究[D];南京理工大學(xué);2015年

2 李丹;基于馬氏超橢球?qū)W習(xí)機(jī)的增量學(xué)習(xí)算法研究[D];渤海大學(xué);2015年

3 趙翠翠;基于RBF神經(jīng)網(wǎng)絡(luò)的集成增量學(xué)習(xí)方法研究[D];河北工業(yè)大學(xué);2015年

4 王會波;基于支持向量機(jī)的混合增量學(xué)習(xí)算法與應(yīng)用[D];華中師范大學(xué);2016年

5 張健;增量學(xué)習(xí)在電子鼻智能烘烤系統(tǒng)中的應(yīng)用研究[D];重慶大學(xué);2016年

6 曾舒如;基于多模態(tài)增量學(xué)習(xí)模型的目標(biāo)物體檢測方法研究[D];南昌大學(xué);2016年

7 潘振春;基于實(shí)例的領(lǐng)域適應(yīng)增量學(xué)習(xí)方法研究[D];南京理工大學(xué);2017年

8 劉國欣;基于增量學(xué)習(xí)SVM分類算法的研究與應(yīng)用[D];中北大學(xué);2017年

9 杜玲;覆蓋算法的增量學(xué)習(xí)研究[D];安徽大學(xué);2010年

10 張智敏;基于增量學(xué)習(xí)的分類算法研究[D];華南理工大學(xué);2010年

，

本文編號：2206764

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2206764.html

上一篇：音樂情感參數(shù)化系統(tǒng)的研究與實(shí)現(xiàn)
下一篇：面向科技創(chuàng)新的科研人員信息需求的調(diào)查與分析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分布式環(huán)境下企業(yè)新聞信息分類子系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)