基于個人微博時序事件的研究
發(fā)布時間:2019-03-15 14:44
【摘要】:微博作為一個新興的社交媒體服務(wù),從各個方面滲透并影響著人們的生活,成為人們共享信息、交流情感的一個重要平臺。其中大部分的個人微博內(nèi)容記錄其生活經(jīng)歷、專業(yè)興趣以及熱點話題的討論等,所以微博數(shù)據(jù)就成了個人履歷情感的載體。由于發(fā)微博的的實時性、便利性有時甚至是秒發(fā),,這樣個人微博就逐漸代替了日記,形成了時記或分記,這樣長時間后形成的微博數(shù)據(jù)量會非常龐大,想要了解博主就只能通過逐條瀏覽其歷史微博,這就造成了時間浪費。如何快速準確的了解博主的動態(tài)已成為目前急需解決的問題,微博歸類就是針對這一問題而提出的。在微博歸類過程中,微博相似度的精度決定了其的準確性,本文研究的重點就是如何提高微博相似度的精確性。 由于個人微博數(shù)據(jù)總體數(shù)量較多、單條簡短及內(nèi)容隨意性大等特性,利用傳統(tǒng)分類方法以及信息提取算法進行處理時存在一定的局限性。本文考慮到單條微博文本信息簡短包含的有效特征少,且內(nèi)容比較口語化的特性,從同類詞方面對文本的特征詞進行了擴展,盡量降低特征丟失的可能性,提出了一種基于改進的Jaccard相似度和余弦相似度的綜合相似度算法。首先,對獲取的微博數(shù)據(jù)進行過濾,去除沒有任何信息的文本和無關(guān)鏈接、圖片等,并利用相關(guān)中科院的漢語詞法分詞系統(tǒng)ICTCLAS對文本進行分詞、做詞性標記和過濾停用詞以及表情詞;其次,采用改進的TF-IDF算法提取微博特征詞和LDA(Latent Dirichlet Allocation)主題模型構(gòu)造同類詞模板來提高微博相似度的精度,即先利用特征選擇評估函數(shù)CHI衡量每個特征詞對每個類別的重要程度并使特征詞在該類別文本中符合均勻分布后再計算TF-IDF值來提取微博特征詞;然后,在提取的特征詞和構(gòu)造的同類詞模板的基礎(chǔ)上結(jié)合Jaccard相似度和余弦相似度計算個人微博的綜合相似度,該算法克服了傳統(tǒng)只基于詞語共現(xiàn)方法的不足,能夠從同類詞特征和個體數(shù)值特征等方面更深層次、更全面的計算兩條微博的相似度;最后,利用K-Means時序事件歸類算法對個人微博數(shù)據(jù)進行歸類,使相同話題微博歸類到同一個集合中。 實驗結(jié)果表明本文提出的綜合相似度算法比傳統(tǒng)的相似度算法具有更高的精確度,在一定程度上提高了個人微博時序事件歸類的準確性。
[Abstract]:Weibo, as a new social media service, permeates and affects people's lives from various aspects, and becomes an important platform for people to share information and exchange emotions. Most of the personal Weibo content records their life experience, professional interest and discussion of hot topics, so Weibo data has become the carrier of personal experience emotion. Because of Weibo's real-time, convenience and sometimes even second hair, individual Weibo gradually replaced the diary and formed a chronology or minutes, so that the amount of Weibo data formed after such a long period of time will be very large. If you want to know the blogger, you can only browse the history of Weibo one by one, which causes a waste of time. How to quickly and accurately understand the dynamics of bloggers has become an urgent problem to be solved. Weibo's classification is aimed at this problem. In the process of Weibo classification, the accuracy of Weibo similarity determines its accuracy. The focus of this paper is how to improve the accuracy of Weibo similarity. Because of the large number of individual Weibo data, short and random content, there are some limitations in using the traditional classification method and information extraction algorithm to process the data. Taking into account the few effective features contained in the short message of Weibo and the colloquial character of the content, this paper extends the feature words of the text from the aspect of similar words, and reduces the possibility of feature loss as far as possible. A synthetic similarity algorithm based on improved Jaccard similarity and cosine similarity is proposed. First of all, we filter the obtained Weibo data, remove the text without any information and irrelevant links, pictures, and so on, and use the Chinese word segmentation system ICTCLAS of the relevant Chinese Academy of Sciences to segment the text. Make part-of-speech markers and filter deactivated words and emoji words; Secondly, the improved TF-IDF algorithm is used to extract Weibo feature words and LDA (Latent Dirichlet Allocation) theme model to construct similar word templates to improve the similarity accuracy of Weibo. Firstly, we use the feature selection evaluation function (CHI) to measure the importance of each feature word to each category, and then calculate the TF-IDF value to extract the feature words of Weibo after the feature words accord with the uniform distribution in the text of this category. Then, on the basis of extracting feature words and constructing similar word templates, we combine Jaccard similarity and cosine similarity to calculate the synthetic similarity of individual Weibo. This algorithm overcomes the deficiency of traditional co-occurrence method based on words. The similarity between the two Weibo can be calculated more comprehensively from the similar word features and individual numerical features and other aspects of the deeper level; Finally, the K-Means time series event classification algorithm is used to classify the personal Weibo data, and Weibo, the same topic, is classified into the same set. The experimental results show that the synthetic similarity algorithm proposed in this paper has higher accuracy than the traditional similarity algorithm, and improves the accuracy of individual Weibo temporal event classification to a certain extent.
【學(xué)位授予單位】:內(nèi)蒙古科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092
本文編號:2440722
[Abstract]:Weibo, as a new social media service, permeates and affects people's lives from various aspects, and becomes an important platform for people to share information and exchange emotions. Most of the personal Weibo content records their life experience, professional interest and discussion of hot topics, so Weibo data has become the carrier of personal experience emotion. Because of Weibo's real-time, convenience and sometimes even second hair, individual Weibo gradually replaced the diary and formed a chronology or minutes, so that the amount of Weibo data formed after such a long period of time will be very large. If you want to know the blogger, you can only browse the history of Weibo one by one, which causes a waste of time. How to quickly and accurately understand the dynamics of bloggers has become an urgent problem to be solved. Weibo's classification is aimed at this problem. In the process of Weibo classification, the accuracy of Weibo similarity determines its accuracy. The focus of this paper is how to improve the accuracy of Weibo similarity. Because of the large number of individual Weibo data, short and random content, there are some limitations in using the traditional classification method and information extraction algorithm to process the data. Taking into account the few effective features contained in the short message of Weibo and the colloquial character of the content, this paper extends the feature words of the text from the aspect of similar words, and reduces the possibility of feature loss as far as possible. A synthetic similarity algorithm based on improved Jaccard similarity and cosine similarity is proposed. First of all, we filter the obtained Weibo data, remove the text without any information and irrelevant links, pictures, and so on, and use the Chinese word segmentation system ICTCLAS of the relevant Chinese Academy of Sciences to segment the text. Make part-of-speech markers and filter deactivated words and emoji words; Secondly, the improved TF-IDF algorithm is used to extract Weibo feature words and LDA (Latent Dirichlet Allocation) theme model to construct similar word templates to improve the similarity accuracy of Weibo. Firstly, we use the feature selection evaluation function (CHI) to measure the importance of each feature word to each category, and then calculate the TF-IDF value to extract the feature words of Weibo after the feature words accord with the uniform distribution in the text of this category. Then, on the basis of extracting feature words and constructing similar word templates, we combine Jaccard similarity and cosine similarity to calculate the synthetic similarity of individual Weibo. This algorithm overcomes the deficiency of traditional co-occurrence method based on words. The similarity between the two Weibo can be calculated more comprehensively from the similar word features and individual numerical features and other aspects of the deeper level; Finally, the K-Means time series event classification algorithm is used to classify the personal Weibo data, and Weibo, the same topic, is classified into the same set. The experimental results show that the synthetic similarity algorithm proposed in this paper has higher accuracy than the traditional similarity algorithm, and improves the accuracy of individual Weibo temporal event classification to a certain extent.
【學(xué)位授予單位】:內(nèi)蒙古科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092
本文編號:2440722
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2440722.html
最近更新
教材專著