個(gè)人微博中公共事件檢測(cè)算法的研究
發(fā)布時(shí)間:2018-06-15 22:57
本文選題:微博 + 主題詞。 參考:《內(nèi)蒙古科技大學(xué)》2014年碩士論文
【摘要】:伴隨著計(jì)算機(jī)應(yīng)用技術(shù)的迅猛發(fā)展,互聯(lián)網(wǎng)傳媒也相應(yīng)地的興起并快速地影響著人們的日常生活,與此同時(shí)成為了電視、報(bào)紙和廣播等多種傳統(tǒng)媒體之后的又一個(gè)新聞?shì)d體。由于信息能夠在互聯(lián)網(wǎng)空間內(nèi)實(shí)現(xiàn)快速傳播,其信息本身也呈現(xiàn)出了多元化、公開(kāi)化和實(shí)時(shí)化特征,因此互聯(lián)網(wǎng)充當(dāng)了社會(huì)實(shí)時(shí)熱點(diǎn)事件傳播平臺(tái)的重要角色。 以新浪微博為典型代表,是國(guó)內(nèi)近年來(lái)新興起且發(fā)展迅速的網(wǎng)絡(luò)媒體。用戶可以通過(guò)WEB網(wǎng)頁(yè)、移動(dòng)客戶端等多種途徑隨時(shí)隨地進(jìn)行狀態(tài)更新和信息分享。新浪是目前國(guó)內(nèi)流行度最廣、用戶規(guī)模最大的微博網(wǎng)站,根據(jù)2013年7月最新的數(shù)據(jù)統(tǒng)計(jì)顯示,新浪微博注冊(cè)用戶已達(dá)到3.3億,形成了微博龐大的數(shù)據(jù)量。 由于微博數(shù)據(jù)具有不規(guī)則性、海量性和實(shí)時(shí)性等特點(diǎn)。所以如何從大量的、不規(guī)則的個(gè)人微博數(shù)據(jù)中精確地提取出用戶在某段時(shí)間內(nèi)所關(guān)注公共事件,是當(dāng)前個(gè)人微博信息檢測(cè)技術(shù)首要解決的問(wèn)題。 將個(gè)人微博數(shù)據(jù)作為實(shí)驗(yàn)測(cè)試樣本,主要的研究工作是如何根據(jù)個(gè)人微博信息檢測(cè)出某用戶在某段時(shí)間內(nèi)關(guān)注了哪些公共事件。經(jīng)過(guò)反復(fù)地實(shí)驗(yàn)證明,將傳統(tǒng)的事件提取算法應(yīng)用于個(gè)人微博事件處理結(jié)果并不理想。所以在一系列算法嘗試和多次實(shí)驗(yàn)的基礎(chǔ)上,綜合考慮了個(gè)人微博的非主流文本特征,以短文本數(shù)據(jù)挖掘?yàn)檠芯勘尘埃蕴崛≈黝}詞為課題重點(diǎn),展開(kāi)了從文本獲取、預(yù)處理、相似性度量,特征值計(jì)算、以及最后的公共模板的正向匹配和反向匹配等一系列研究。 課題已經(jīng)形成了一個(gè)合理的、完整的個(gè)人微博公共事件檢測(cè)的操作流程,,概括起來(lái)主要分為文本預(yù)處理、主題詞識(shí)別和公共模板匹配三個(gè)模塊。具體說(shuō)預(yù)處理主要是清除文本的噪音干擾,使得文本的表示方式更加規(guī)范化;主題詞主要是基于耦合、時(shí)序和流行三個(gè)相似度的計(jì)算以及應(yīng)用提出的TF-DF函數(shù)二者相結(jié)合的方法進(jìn)行提取,這樣不僅考慮了實(shí)驗(yàn)的數(shù)據(jù)特征,同時(shí)也提高了主題詞提取的準(zhǔn)確率;公共模版匹配通過(guò)主題詞與新浪風(fēng)云榜的模板事件依次進(jìn)行正向匹配和反向匹配兩個(gè)步驟,得到最終的公共事件檢測(cè)結(jié)果。
[Abstract]:With the rapid development of computer application technology, Internet media is also rising and rapidly affecting people's daily life. At the same time, it has become another news carrier after many traditional media such as TV, newspaper and radio. Because the information can spread rapidly in the Internet space, its information itself presents the characteristics of diversification, openness and real-time, so the Internet plays an important role in the communication platform of social real-time hot events. Sina Weibo as a typical representative, is a new and rapid development of domestic network media in recent years. Users can use Web pages, mobile clients and other ways to update their status and share information anytime and anywhere. Sina is the most popular and largest Weibo site in China. According to the latest statistics in July, 2013, Sina Weibo registered 330 million users, forming a huge amount of Weibo data. Because of the irregularity, magnanimity and real-time of Weibo data, etc. Therefore, how to accurately extract the public events that users pay attention to in a certain period of time from a large number of irregular personal Weibo data is the first problem to be solved by the current personal Weibo information detection technology. Taking the personal Weibo data as the experimental test sample, the main research work is how to detect which public events a user pays attention to in a certain period of time according to the personal Weibo information. After repeated experiments, it is proved that the application of the traditional event extraction algorithm to the personal Weibo event processing is not satisfactory. Therefore, on the basis of a series of algorithm attempts and many experiments, this paper synthetically considers the non-mainstream text features of individual Weibo, takes the short text mining as the research background, and focuses on extracting the theme words. Similarity measurement, eigenvalue calculation, and the final common template forward matching and reverse matching are studied. The subject has formed a reasonable and complete operation flow of personal Weibo common event detection, which can be divided into three modules: text preprocessing, subject word recognition and common template matching. Specifically, preprocessing is mainly to clear the noise interference of the text, which makes the presentation of the text more standardized; the theme words are mainly based on coupling. The computation of three similarity degrees of time sequence and popularity and the method of combining TF-DF function proposed to extract them not only consider the experimental data features, but also improve the accuracy of the subject word extraction. Public template matching through the theme words and Sina Fengyun list of template events in turn to carry out two steps of forward matching and reverse matching to obtain the final public event detection results.
【學(xué)位授予單位】:內(nèi)蒙古科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 雷小鋒;謝昆青;林帆;夏征義;;一種基于K-Means局部最優(yōu)性的高效聚類算法[J];軟件學(xué)報(bào);2008年07期
2 張建娥;;基于TFIDF和詞語(yǔ)關(guān)聯(lián)度的中文關(guān)鍵詞提取方法[J];情報(bào)科學(xué);2012年10期
相關(guān)博士學(xué)位論文 前1條
1 王樂(lè);短語(yǔ)消息聚類相關(guān)技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2008年
本文編號(hào):2023989
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2023989.html
最近更新
教材專著