基于W-BTM的短文本主題挖掘及文本分類應用
發(fā)布時間:2018-01-20 12:58
本文關鍵詞: W-BTM模型 主題挖掘 短文本 文本分類 出處:《山西財經(jīng)大學》2017年碩士論文 論文類型:學位論文
【摘要】:隨著互聯(lián)網(wǎng)和各類社交網(wǎng)站以及電子商務的快速興起,以文本信息為代表的非結構化信息大量涌現(xiàn),從中挖掘出有價值的信息變得越來越重要,但同時復雜的語義也使得信息價值的提取變得越來越困難。尤其是短文本信息,其稀疏性和不完整性也給文本挖掘帶來了新的巨大挑戰(zhàn)。因此,對于文本信息挖掘的研究逐步轉向了對于短文本信息挖掘的研究。BTM是一個針對短文本的主題挖掘模型,在處理短文本的稀疏性和不完整性問題上相對于其它主題模型有很大的優(yōu)勢。但包括BTM模型在內的現(xiàn)有文本挖掘模型,模型中都沒有特殊的參數(shù)設置等對其進行處理,只是在數(shù)據(jù)預處理時加載停用詞表對其進行刪除操作。而不同的語料選擇會有差異性,千篇一律的使用同樣的停用詞表并不具有科學性。因此,對于不同的語料集,應該找出可以反映其文本特征的停用詞;趯ι鲜龆涛谋咎攸c和停用詞處理的考慮,以差異系數(shù)作為權重模型,表示文本中詞語的權重,然后將其作為BTM模型的一個參數(shù)形成最終的W-BTM模型,從而消除短文本和停用詞對文本主題挖掘的影響。模型中使用吉布斯抽樣對參數(shù)進行估計,從潛在變量的先驗分布中抽樣,對后驗參數(shù)進行估計。最后將模型應用于當當網(wǎng)圖書簡介數(shù)據(jù),使用支持向量機對W-BTM模型產(chǎn)生的結果矩陣進行分類,并對比不同模型的分類結果,證明W-BTM模型的優(yōu)越性。W-BTM模型在整個語料集中尋找“詞對”的前提是“詞對”中每個詞在整個文檔中的權重即差異系數(shù)已知。在這種情況下,“詞對”有了更深層次的含義,它不再只是單一的表示文檔中同時出現(xiàn)的兩個詞語,而且還代表著詞語本身的性質,即是否為停用詞。這就可以消除停用詞的不恰當選擇對于文本信息挖掘準確性的影響。為了驗證W-BTM的有效性和科學性,以LDA模型和BTM模型做對比進行文本分類實驗和應用,從主題挖掘和文本分類兩個角度對整個的實驗結果進行評價,最終證明了W-BTM模型的分類效果優(yōu)于LDA模型和BTM模型。本文的創(chuàng)新之處如下:(1)對于停用詞的處理,拋棄傳統(tǒng)的選擇停用詞表并將停用詞直接去除的方法,而是使用權重模型取而代之,使得文本挖掘的結果更加科學和準確。(2)將權重模型與BTM模型相結合,形成新的主題模型W-BTM,既可以用于短文本的分類,解決短文本的稀疏性問題,也彌補了數(shù)據(jù)預處理時停用詞處理的漏洞。(3)將W-BTM模型應用于當當網(wǎng)圖書簡介分類,賦予模型更加實際的現(xiàn)實意義。通過對數(shù)據(jù)不平衡性的處理、W-BTM模型的使用以及支持向量機對于文本-主題矩陣的分類,最終驗證了W-BTM模型的有效性。針對分類結果,將W-BTM模型與LDA模型和BTM模型進行對比,驗證了W-BTM模型的優(yōu)越性。
[Abstract]:With the rapid rise of Internet, social networking sites and electronic commerce, unstructured information, represented by text information, emerges in large numbers, and it becomes more and more important to mine valuable information from it. But at the same time, the complexity of semantics also makes it more and more difficult to extract the information value. Especially, the sparsity and incompleteness of the short text text information also bring a great challenge to text mining. The research of text information mining has gradually turned to the research of short text information mining. BTM is a topic mining model for short text. It has a great advantage over other topic models in dealing with the sparsity and incompleteness of short text, but the existing text mining models, including BTM model. There are no special parameter settings in the model to deal with them, only when the data preprocessing loading stop vocabulary to delete the operation, and different corpus selection will be different. It is not scientific to use the same stop thesaurus all the time. Therefore, for different corpus. Based on the consideration of the characteristics of the text and the processing of the stop word, the difference coefficient is used as the weight model to express the weight of the words in the text. Then, as a parameter of BTM model, the final W-BTM model is formed to eliminate the influence of short text and stop word on text topic mining. Gibbs sampling is used to estimate the parameters in the model. Sampling from the prior distribution of potential variables, the posterior parameters are estimated. Finally, the model is applied to the Dangdang network book profile data, and the support vector machine is used to classify the result matrix generated by the W-BTM model. The classification results of different models were compared. The premise of W-BTM model searching for word pair in the whole corpus is that the weight of each word in the whole document is known, that is, the coefficient of difference is known. "word to" has a deeper meaning, it is not only a single representation of the two words in the document, but also represents the nature of the word itself. This can eliminate the influence of improper choice of discontinuation words on the accuracy of text information mining. In order to verify the validity and scientific nature of W-BTM. The experiment and application of text classification are carried out by comparing LDA model with BTM model, and the whole experiment result is evaluated from two angles of topic mining and text classification. Finally, it is proved that the classification effect of W-BTM model is better than that of LDA model and BTM model. Instead of the traditional method of choosing to stop the word table and removing the stop word directly, the weight model is used instead. Make the result of text mining more scientific and accurate. 2) combine the weight model and BTM model to form a new topic model W-BTM. it can be used in the classification of short text. To solve the problem of short text sparsity, it also makes up the loophole of discontinuation word processing in data preprocessing. (3) the W-BTM model is applied to the classification of book profiles in Dangdang. By dealing with the imbalance of data, the use of W-BTM model and the classification of text-topic matrix by support vector machine (SVM) are given more practical significance. Finally, the validity of W-BTM model is verified, and the superiority of W-BTM model is verified by comparing W-BTM model with LDA model and BTM model.
【學位授予單位】:山西財經(jīng)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 鞠哲;曹雋U,
本文編號:1448271
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1448271.html
最近更新
教材專著