天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博熱點話題發(fā)現(xiàn)研究與實現(xiàn)

發(fā)布時間:2018-02-23 01:09

  本文關鍵詞: 微博 熱點話題發(fā)現(xiàn) 微博API Single-Pass算法 LDA模型 出處:《鄭州大學》2014年碩士論文 論文類型:學位論文


【摘要】:隨著互聯(lián)網(wǎng)的快速發(fā)展以及移動互聯(lián)網(wǎng)的全面普及,網(wǎng)民們相互溝通了解的方式越來越多樣化。微博作為一個新興的平臺,以其獨特的靈活性和便捷性,更加受到網(wǎng)民的青睞。微博給人們生活帶來極大便利的同時,也產(chǎn)生了一些副作用,例如一些人使用微博蓄意傳播假消息,給社會安定造成不良的影響。如果能夠及早發(fā)現(xiàn)這些話題,就能及時采取相應的措施。對用戶來說,用戶只能看到自己主頁上的微博消息,,無法了解到整個微博網(wǎng)絡中大多數(shù)用戶都在討論或者關注哪些事件。因此,及時發(fā)現(xiàn)微博熱點話題是非常有意義的。 本文定義了話題的熱度,從定量的角度來表達熱點話題,對于某個話題來說,包含的微博發(fā)布時間越晚,評論數(shù)和轉發(fā)數(shù)越多,該話題的熱度越高,越有可能是熱點話題。國內(nèi)外大量學者都在熱點話題發(fā)現(xiàn)上做了許多研究,總結出來大致有聚類算法、LDA模型、情感模型三種方法,或者是在此基礎上進行改進。本文在研究微博熱點話題發(fā)現(xiàn)的過程中,首先需要解決微博語料的問題,傳統(tǒng)的網(wǎng)絡爬蟲無法適用于微博信息抓取,而且微博API也只能抓取本人微博主頁上的微博信息,無法獲取大量的微博信息,所以本文根據(jù)微博用戶之間相互關注的關系獲取大量用戶信息,然后抓取這些用戶最新發(fā)表的微博信息。接下來需要對微博進行預處理,包括過濾垃圾微博信息、分詞、去除停用詞、無用信息過濾、特征詞提取、特征權重計算,為每一條微博文本生成特征向量。最后針對微博不斷增加的特點,選擇適合的Single-Pass增量聚類算法,得到多個簇,每個簇代表一個話題,每一個話題下包含許多條微博。為了從話題中選擇出熱點話題,文中定義了話題的熱度,發(fā)布時間越晚、評論數(shù)和轉發(fā)數(shù)越多的話題,熱度越高,成為熱點話題的可能性越大。 從大量學者的研究中發(fā)現(xiàn),LDA主題模型也能夠用來發(fā)現(xiàn)話題,但是它需要多次迭代,處理大量數(shù)據(jù)時運行時間比較長。不過LDA主題模型在主題表達方面比較有優(yōu)勢,所以本文將Single-Pass算法與LDA模型結合起來,先利用Single-Pass聚類算法對微博文本聚類,然后利用LDA算法處理每一個簇,最后得到微博熱點話題,這樣比單獨使用Single-Pass能生成更加準確的話題,比單獨使用LDA模型處理速度更快。
[Abstract]:With the rapid development of the Internet and the overall popularity of mobile Internet, Internet users to communicate with each other more and more diverse ways of micro-blog. As a new platform, with its unique flexibility and convenience, more users of all ages. Micro-blog has brought great convenience to people's life at the same time, also have some side effects, such as some people use micro-blog deliberately spread false news, causing adverse effects to social stability. If we can find these topics, we may be able to take corresponding measures. For users, users can only see from micro-blog news has on the home page, you can not understand the majority of users throughout the micro-blog network in the discussion or attention. So what events micro-blog, found that the hot topic is very meaningful in a timely manner.
This paper defines the topic of heat, to express the topic from the quantitative point of view for a topic, including the micro-blog released the late time, the number of comments and forwarding number, the topic of heat is high, the more likely it is a hot topic. Many scholars at home and abroad are found on the hot topic there are many studies, summed up the clustering algorithm, LDA model, emotion model three methods, or improve on this basis. This paper found in hot topic on micro-blog, micro-blog first need to solve the problem of corpora, traditional web crawlers cannot apply to micro-blog information capture, API and micro-blog can grab me the micro-blog home page on micro-blog information, unable to get a lot of micro-blog information, so according to the relationship between the attention of micro-blog users get a lot of user information, and then grab the users the latest. Table next to micro-blog. Micro-blog information pretreatment, including micro-blog word, information filtering spam, remove stop words, useless information filtering, feature extraction, feature weight calculation, for each micro-blog text feature vectors. Finally, according to the characteristics of micro-blog increased, Single-Pass incremental clustering algorithm for the get a plurality of clusters, each cluster represents a topic, each topic contains a lot of micro-blog. In order to select a topic from the topic, this paper defines the topic of heat release, the late time, the number of comments and forwarding topic number, the higher the heat, become the hot topic of the possibility of more.
The study found that a large number of scholars in the LDA topic model also can be used to find the topic, but it needs many iterations, the processing of large amounts of data to run a long time. But LDA topic model in theme expression of comparative advantage, so this paper introduces Single-Pass algorithm and LDA model combined by using the Single-Pass clustering algorithm on micro-blog text clustering, then we use the LDA algorithm to handle each cluster, and finally get the micro-blog hot topic, so than using Single-Pass alone can generate more accurate than a single topic, using the LDA model processing speed is faster.

【學位授予單位】:鄭州大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092

【參考文獻】

相關期刊論文 前10條

1 龍樹全;趙正文;唐華;;中文分詞算法概述[J];電腦知識與技術;2009年10期

2 趙前東;葉猛;;微博熱點話題檢測系統(tǒng)的設計與實現(xiàn)[J];電視技術;2013年03期

3 谷文成;柴寶仁;韓俊松;;基于支持向量機的垃圾信息過濾方法[J];北京理工大學學報;2013年10期

4 孫國菊,張杰;中文文本分類的特征選取評價[J];哈爾濱理工大學學報;2005年01期

5 劉麗珍,宋瀚濤;文本分類中的特征選取[J];計算機工程;2004年04期

6 馮進;丁博;史殿習;張矚熹;許凱;;XML解析技術研究[J];計算機工程與科學;2009年02期

7 王小偉;王黎明;;基于動態(tài)人工免疫的郵件分類算法研究[J];計算機應用;2006年10期

8 楊亮;林原;林鴻飛;;基于情感分布的微博熱點事件發(fā)現(xiàn)[J];中文信息學報;2012年01期

9 龐景安;;Web信息采集技術研究與發(fā)展[J];情報科學;2009年12期

10 莫建文;鄭陽;首照宇;張順嵐;;改進的基于詞典的中文分詞方法[J];計算機工程與設計;2013年05期



本文編號:1525763

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1525763.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶a60a9***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com