天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于機器學習的網(wǎng)絡輿情采集技術(shù)研究與設計

發(fā)布時間:2018-04-08 09:54

  本文選題:網(wǎng)絡輿情 切入點:機器學習 出處:《電子科技大學》2014年碩士論文


【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,網(wǎng)絡平臺的重要性愈發(fā)突出,網(wǎng)絡中虛假、暴力、消極的網(wǎng)絡輿情對社會穩(wěn)定和國家安全的影響也越來越大。針對網(wǎng)絡輿情進行有效采集在預防不良信息的傳播,穩(wěn)定社會秩序,保證公共安全方面有著重要意義。本文重點研究分析及改進了網(wǎng)絡輿情采集系統(tǒng)的關(guān)鍵技術(shù):文本聚類,設計并實現(xiàn)了一個網(wǎng)絡輿情采集原型系統(tǒng)。1、本文對文本聚類中的Single-Pass算法進行了改進。作為基于機器學習的網(wǎng)絡輿情采集技術(shù),無監(jiān)督機器學習的文本聚類算法是其核心。Single-Pass算法雖然對網(wǎng)絡信息的話題提取有較為優(yōu)異的性能,但是該聚類算法對于文本輸入順序的依賴性較強,對于相同的數(shù)據(jù)集,輸入數(shù)據(jù)不同可能導致聚類結(jié)果的差異。本文設計了一種基于雙閾值的Single-Pass算法,通過建立中間狀態(tài)規(guī)范簇類中心向量的偏移來降低對輸入順序的依賴性強度。此次改進通過實驗證明對文本聚類的性能有較大提升。2、本文改進了基于DOM樹改進的正文提取方式,該方式結(jié)合中文字符和非鏈接文字的分布比率來優(yōu)化傳統(tǒng)的基于DOM樹的正文提取方法,提升了輿情采集系統(tǒng)中正文提取的精確性。3、本文構(gòu)建了基于機器學習的網(wǎng)絡輿情采集系統(tǒng)架構(gòu),設計并實現(xiàn)了原型系統(tǒng),并對其核心模塊和系統(tǒng)整體進行測試。
[Abstract]:With the rapid development of Internet technology, the importance of network platform becomes more and more prominent. The influence of false network, violence and negative network public opinion on social stability and national security is also increasing.Effective collection of network public opinion is of great significance in preventing the spread of bad information, stabilizing social order and ensuring public safety.This paper focuses on the analysis and improvement of the key technology of the network public opinion collection system: text clustering, designs and implements a network public opinion collection prototype system. This paper improves the Single-Pass algorithm in text clustering.As a network public opinion collection technology based on machine learning, unsupervised machine learning text clustering algorithm is its core. Single-Pass algorithm has excellent performance for topic extraction of network information.However, the clustering algorithm is strongly dependent on the order of text input. For the same data set, different input data may lead to the difference of clustering results.In this paper, a Single-Pass algorithm based on double threshold is designed to reduce the dependence on the input order by establishing the shift of the center vector of the intermediate state specification cluster.The improvement has been proved by experiments to improve the performance of text clustering greatly. This paper improves the text extraction method based on DOM tree.This method combines the distribution ratio of Chinese characters and unlinked text to optimize the traditional text extraction method based on DOM tree.Improve the accuracy of text extraction in the public opinion collection system. This paper constructs the network public opinion collection system architecture based on machine learning, designs and implements the prototype system, and tests its core module and system as a whole.
【學位授予單位】:電子科技大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.08;TP181

【參考文獻】

相關(guān)期刊論文 前1條

1 陳玉芳,葛燧和;一個基于XML的WEB數(shù)據(jù)收集模型的研究[J];計算機工程與應用;2004年10期

相關(guān)碩士學位論文 前1條

1 莫卓穎;基于語義DOM的WEB信息抽取[D];廣西師范大學;2012年



本文編號:1721102

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1721102.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶da447***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com