基于機(jī)器學(xué)習(xí)的網(wǎng)絡(luò)輿情采集技術(shù)研究與設(shè)計(jì)
發(fā)布時(shí)間:2018-04-08 09:54
本文選題:網(wǎng)絡(luò)輿情 切入點(diǎn):機(jī)器學(xué)習(xí) 出處:《電子科技大學(xué)》2014年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,網(wǎng)絡(luò)平臺(tái)的重要性愈發(fā)突出,網(wǎng)絡(luò)中虛假、暴力、消極的網(wǎng)絡(luò)輿情對(duì)社會(huì)穩(wěn)定和國(guó)家安全的影響也越來(lái)越大。針對(duì)網(wǎng)絡(luò)輿情進(jìn)行有效采集在預(yù)防不良信息的傳播,穩(wěn)定社會(huì)秩序,保證公共安全方面有著重要意義。本文重點(diǎn)研究分析及改進(jìn)了網(wǎng)絡(luò)輿情采集系統(tǒng)的關(guān)鍵技術(shù):文本聚類,設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)網(wǎng)絡(luò)輿情采集原型系統(tǒng)。1、本文對(duì)文本聚類中的Single-Pass算法進(jìn)行了改進(jìn)。作為基于機(jī)器學(xué)習(xí)的網(wǎng)絡(luò)輿情采集技術(shù),無(wú)監(jiān)督機(jī)器學(xué)習(xí)的文本聚類算法是其核心。Single-Pass算法雖然對(duì)網(wǎng)絡(luò)信息的話題提取有較為優(yōu)異的性能,但是該聚類算法對(duì)于文本輸入順序的依賴性較強(qiáng),對(duì)于相同的數(shù)據(jù)集,輸入數(shù)據(jù)不同可能導(dǎo)致聚類結(jié)果的差異。本文設(shè)計(jì)了一種基于雙閾值的Single-Pass算法,通過(guò)建立中間狀態(tài)規(guī)范簇類中心向量的偏移來(lái)降低對(duì)輸入順序的依賴性強(qiáng)度。此次改進(jìn)通過(guò)實(shí)驗(yàn)證明對(duì)文本聚類的性能有較大提升。2、本文改進(jìn)了基于DOM樹(shù)改進(jìn)的正文提取方式,該方式結(jié)合中文字符和非鏈接文字的分布比率來(lái)優(yōu)化傳統(tǒng)的基于DOM樹(shù)的正文提取方法,提升了輿情采集系統(tǒng)中正文提取的精確性。3、本文構(gòu)建了基于機(jī)器學(xué)習(xí)的網(wǎng)絡(luò)輿情采集系統(tǒng)架構(gòu),設(shè)計(jì)并實(shí)現(xiàn)了原型系統(tǒng),并對(duì)其核心模塊和系統(tǒng)整體進(jìn)行測(cè)試。
[Abstract]:With the rapid development of Internet technology, the importance of network platform becomes more and more prominent. The influence of false network, violence and negative network public opinion on social stability and national security is also increasing.Effective collection of network public opinion is of great significance in preventing the spread of bad information, stabilizing social order and ensuring public safety.This paper focuses on the analysis and improvement of the key technology of the network public opinion collection system: text clustering, designs and implements a network public opinion collection prototype system. This paper improves the Single-Pass algorithm in text clustering.As a network public opinion collection technology based on machine learning, unsupervised machine learning text clustering algorithm is its core. Single-Pass algorithm has excellent performance for topic extraction of network information.However, the clustering algorithm is strongly dependent on the order of text input. For the same data set, different input data may lead to the difference of clustering results.In this paper, a Single-Pass algorithm based on double threshold is designed to reduce the dependence on the input order by establishing the shift of the center vector of the intermediate state specification cluster.The improvement has been proved by experiments to improve the performance of text clustering greatly. This paper improves the text extraction method based on DOM tree.This method combines the distribution ratio of Chinese characters and unlinked text to optimize the traditional text extraction method based on DOM tree.Improve the accuracy of text extraction in the public opinion collection system. This paper constructs the network public opinion collection system architecture based on machine learning, designs and implements the prototype system, and tests its core module and system as a whole.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.08;TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 陳玉芳,葛燧和;一個(gè)基于XML的WEB數(shù)據(jù)收集模型的研究[J];計(jì)算機(jī)工程與應(yīng)用;2004年10期
相關(guān)碩士學(xué)位論文 前1條
1 莫卓穎;基于語(yǔ)義DOM的WEB信息抽取[D];廣西師范大學(xué);2012年
,本文編號(hào):1721102
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1721102.html
最近更新
教材專著