天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

微博評論信息的聚類分析

發(fā)布時間:2018-03-08 13:07

  本文選題:微博評論分析 切入點:中文分詞 出處:《安徽大學》2017年碩士論文 論文類型:學位論文


【摘要】:微博作為一種分享和交流信息的社交平臺,自2009年國內(nèi)公司新浪推出微博平臺以來得到了快速發(fā)展和廣泛的應用。截至2016年9月30日,新浪微博月活躍用戶已達到2.97億。微博信息具有信息交互簡便快捷、隨時隨地傳播信息、信息發(fā)布門檻低、傳播方式呈裂變等特點。作為一個新聞發(fā)布平臺、新聞發(fā)生地和信息交互平臺,微博在人們了解信息、發(fā)布信息、信息交流等日常網(wǎng)絡行為中充當越來越重要的角色。但相比之下,微博信息具有簡短、數(shù)量龐大、內(nèi)容復雜的特點,傳統(tǒng)的數(shù)據(jù)挖掘方法在對該類型信息分析時面臨諸多挑戰(zhàn)。為此,本文利用了文本聚類方法,針對微博評論信息的特點并根據(jù)微博熱點事件的大量用戶評論展開分析,探索出一套以文本聚類為基礎的微博評論信息處理的可行方法。目的在于將內(nèi)容相近或相似的評論信息聚集成簇,了解社會對熱點事件的不同觀點,能夠進行有效的輿情分析與檢測,對于特定事件還能讓領導層更好的了解民意,有助于進行決策改革。本文主要工作如下:首先分析了微博文本信息的特點,研究了常用的文本信息分析方法,闡述了聚類分析技術,包括聚類的定義、形式和相似度量方法。其次,針對微博信息特點和信息處理方式,分析了微博評論信息的聚類步驟,包括文本預處理、微博文本表示以及聚類分析。在文本預處理階段,討論了中文分詞、停用詞過濾和文本去噪等,在文本表示階段,討論了多種文本表示方法和特征項的權(quán)重表示方法,在文本聚類階段,分析了聚類的不同方法并描述了多種算法。通過上述討論分析,確定了本文采用的具體分析方法。接著利用R軟件進行文本去噪并通過jiebaR包完成中文分詞、停用詞過濾等預處理工作。在分析比較了多種文本表示方法之后,本文采用向量空間模型表示微博評論文本。而在選擇聚類算法時,采用了廣泛使用的k-means算法,但考慮到k-means算法對初始點和離群點敏感,k值需要人為設定的缺點,增加了 k-medoids算法。這是因為k-medoids算法和k-means算法相似,但對離群點具有魯棒性,并且在R軟件的pamk函數(shù)中k值不需要人為設定。在具體的算法實現(xiàn)過程中,分析了k值和初始點的不同對聚類結(jié)果的影響,探討了R語言實現(xiàn)k-medoids算法和k-means算法的途徑。利用詞云和詞項網(wǎng)絡等方式將微博評論信息進行可視化。本文抓取4月26日央視新聞發(fā)布的關于首艘國產(chǎn)航母下水的微博的4000多條評論,對評論集進行數(shù)據(jù)預處理和文本表示之后,對結(jié)構(gòu)化數(shù)據(jù)進行開展詞項聚類和文檔聚類。通過實驗發(fā)現(xiàn),不同的隨機種子的選擇對聚類結(jié)果影響不大,由于本文數(shù)據(jù)量并不大,所以算法運行時間上并沒有明顯差異。在利用系統(tǒng)聚類法對特征項進行詞項聚類時,采用離差平方和法與最大距離法的系統(tǒng)聚類結(jié)果較好。利用k-medoids聚類分析得到的結(jié)果顯示其最佳聚類結(jié)果簇個數(shù)為2,但是其平均陰影值為0.69,表明兩個個簇之間的劃分較好。由于本文采用基于詞典的分詞方法和空間向量模型,特征項之間的語義聯(lián)系弱,使得聚類結(jié)果不夠合理。
[Abstract]:Micro-blog as a social platform to share and exchange information, since 2009, the domestic company Sina launched micro-blog platform has been rapid development and wide application. As of September 30, 2016, Sina micro-blog monthly active users has reached 297 million. Micro-blog has information interaction is convenient, whenever and wherever possible the dissemination of information, information dissemination mode has a low threshold, such as fission. As a news release platform, news and information exchange platform, micro-blog in the understanding of information, dissemination of information, exchange of information and other daily network behavior plays a more and more important role. But in contrast, micro-blog has a large number of short information, content is complex, the traditional data mining method on the challenges facing the analysis of the type of information. Therefore, this paper uses text clustering method, according to the characteristics of micro-blog review information according to the micro A large number of user reviews Bo hot events to analyze, to explore a set of text clustering based on micro-blog information processing methods. The objective is to review information content of close or similar clusters, understand the different views on social hot events, in which public opinion analysis and effective detection for specific events, but also let leaders better understand public opinion, contribute to the decision-making reform. The main work is as follows: firstly, analyzes the characteristics of micro-blog text information, the study of text information analysis method, describes the clustering analysis technology, including the definition of the cluster, form and method of similarity measure. Secondly, according to the characteristics of micro-blog information and information processing method, analyzes the clustering step for micro-blog review information, including text preprocessing, text representation and micro-blog clustering analysis. In the text pre-processing stage, discussed in the The stop word filtering and text segmentation, denoising, in text representation stage, discusses various text representation methods and feature weights, in the phase of text clustering, clustering analysis of different methods and describes several algorithms. Through the above discussion and analysis, to determine the specific method used in this paper. Then text denoising by jiebaR Chinese segmentation using R software package is completed, stop word filtering pretreatment. After comparing and analyzing kinds of text representation method, this paper uses the vector space model to express the micro-blog text. And in the choice of clustering algorithm, the k-means algorithm is widely used, but considering the k-means algorithm on the initial and the outlier sensitive, K value should be set artificially increased the shortcomings of k-medoids algorithm. This is because the k-medoids algorithm and K-means algorithm are similar, but is robust to outliers And, in the pamk function of R software in the K value should be set artificially. In the specific implementation process of the algorithm, analyzes the influence of K value and the initial point of different clustering results, discussed the k-medoids algorithm and K-means algorithm R language. Using words and lexical entry network will comment on micro-blog information visualization. More than 4000 comments the CCTV news release on April 26th to grab the first domestic aircraft carrier launched micro-blog, after data preprocessing and text representation of comments on structured data sets, carry out lexical entry clustering and document clustering. Through the experiment, different random seed selection has little effect on the clustering results. The amount of data is not large, so the running time of the algorithm and there is no significant difference. The lexical entry to cluster features in the system clustering method is used, the deviation square method and the System clustering results the maximum distance method is better. By k-medoids cluster analysis results indicate that the optimal clustering result of cluster number is 2, but the average shadow value of 0.69, indicates that the division between the two clusters were better. Because the paper uses word segmentation method and space vector model dictionary based on the semantic relation between feature items weak, the clustering result is not reasonable.

【學位授予單位】:安徽大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1

【參考文獻】

相關期刊論文 前10條

1 馮時;景珊;楊卓;王大玲;;基于LDA模型的中文微博話題意見領袖挖掘[J];東北大學學報(自然科學版);2013年04期

2 羅寧;徐俊剛;郭洪韜;;基于Lucene的中文分詞模塊的設計和實現(xiàn)[J];電子技術;2012年09期

3 林江豪;陽愛民;周詠梅;陳錦;蔡澤鍵;;一種基于樸素貝葉斯的微博情感分類[J];計算機工程與科學;2012年09期

4 張晨逸;孫建伶;丁軼群;;基于MB-LDA模型的微博主題挖掘[J];計算機研究與發(fā)展;2011年10期

5 朱艷輝;栗春亮;徐葉強;柳位平;;一種基于多重詞典的中文文本情感特征抽取方法[J];湖南工業(yè)大學學報;2011年02期

6 劉興亮;;微博的傳播機制及未來發(fā)展思考[J];新聞與寫作;2010年03期

7 楊麗華;戴齊;楊占華;;文本分類技術研究[J];微計算機信息;2006年15期

8 鄧宏濤;中文自動分詞系統(tǒng)的設計模型[J];計算機與數(shù)字工程;2005年04期

9 岳濤;漢語自動分詞技術的最新發(fā)展及其在信息檢索中的應用[J];情報雜志;2005年04期

10 陳治平,林亞平,彭雅,王雷,童調(diào)生;基于最小類差異的無關信息預處理算法[J];電子學報;2003年11期

,

本文編號:1584022

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1584022.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4d47c***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com