基于LDA的短文本分類中特征擴(kuò)展方法的研究
發(fā)布時間:2018-04-29 00:11
本文選題:主題模型 + 特征擴(kuò)展; 參考:《中國地質(zhì)大學(xué)(北京)》2017年碩士論文
【摘要】:隨著信息時代的到來,人們花在網(wǎng)上的時間越來越多,一些內(nèi)容分發(fā)平臺、社交網(wǎng)站等近幾年迅速發(fā)展起來。網(wǎng)絡(luò)輿情的分析,網(wǎng)絡(luò)新聞的整理等都需要按照一定的要求進(jìn)行分類,這就涉及到文本分類,特別是短文本分類的研究。用于短文本的分類不能照搬長文分類,一種思路是先對短文本進(jìn)行分類關(guān)鍵詞的擴(kuò)展,然后利用分類器分類。根據(jù)這個思路,本文提出了一種利用LDA的主題詞和特征分類權(quán)重相結(jié)合的特征擴(kuò)展的方法。本文深入研究了傳統(tǒng)長文本常用的表示模型:向量空間模型,認(rèn)為向量空間模型適合表示關(guān)鍵詞信息比較多的長文本,而對于關(guān)鍵詞比較少的短文本,會出現(xiàn)特征向量空間稀疏性過高的問題,從而向量空間模型不能直接用來表示短文本。根據(jù)國內(nèi)外的研究現(xiàn)狀,本文研究了LDA模型的理論基礎(chǔ),利用LDA模型得到語料庫的主題-單詞分布,用LDA模型計算測試樣本的所屬主題,分析測試文本與所屬主題下的主題詞之間的相關(guān)性。由此認(rèn)為直接利用LDA模型的主題詞對短文本進(jìn)行主題擴(kuò)展時存在不足。根據(jù)LDA模型的特點,針對直接利用LDA的主題詞進(jìn)行特征擴(kuò)展的不足,本文提出能體現(xiàn)特征詞在不同類別之間的分類信息差異的特征分類權(quán)重,特征分類權(quán)重考慮了特征詞在類間的分布信息、類內(nèi)的離散度以及特征詞在類內(nèi)的不完全分類情況。因此引入了利用LDA的主題詞進(jìn)行特征擴(kuò)展時的候選詞自選機(jī)制。為驗證本文方法的有效性,本文采用ICTCLAS(中科院分詞工具)和LIBSVM搭建分類平臺,將本文提出的特征擴(kuò)展方法與傳統(tǒng)的基于LDA特征擴(kuò)展的短文本分類方法進(jìn)行對比。實驗證明,利用本文方法對短文本進(jìn)行特征擴(kuò)展后,分類的性能得到了一定程度的提升。
[Abstract]:With the advent of the information age, people spend more and more time on the Internet, some content distribution platforms, social networking sites and other rapid development in recent years. The analysis of network public opinion and the arrangement of network news need to be classified according to certain requirements, which involves text classification, especially the study of text classification. Long text classification can not be used in short text classification. One way of thinking is to extend the short text text first and then use classifier to classify the short text. According to this idea, this paper proposes a method of feature expansion which combines the theme words of LDA and the weight of feature classification. This paper deeply studies the traditional representation model of long text: vector space model. It is considered that vector space model is suitable for long text with more keyword information, but for short text with fewer keywords. The problem of high sparsity of eigenvector space will occur, so the vector space model can not be directly used to express short text. According to the current research situation at home and abroad, this paper studies the theoretical basis of LDA model, uses LDA model to get the corpus topic-word distribution, uses LDA model to calculate the subject of test sample. Analyze the correlation between the test text and the subject word under the subject. It is concluded that there are some shortcomings in the theme extension of the short text by using the theme words of the LDA model directly. According to the characteristics of LDA model, aiming at the deficiency of extending feature directly by using LDA's theme words, this paper puts forward the weight of feature classification which can reflect the difference of classification information between different categories of feature words. The weight of feature classification takes into account the distribution of feature words between classes, the degree of dispersion within classes and the incomplete classification of feature words within classes. Therefore, this paper introduces the candidate word selection mechanism when using LDA theme words for feature extension. In order to verify the effectiveness of this method, this paper uses ICTCLASA (Chinese Academy of Sciences word Segmentation tool) and LIBSVM to build a classification platform, and compares the proposed feature extension method with the traditional short text classification method based on LDA feature extension. The experimental results show that the classification performance is improved to a certain extent by extending the feature of the short text by using the method in this paper.
【學(xué)位授予單位】:中國地質(zhì)大學(xué)(北京)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【相似文獻(xiàn)】
中國期刊全文數(shù)據(jù)庫 前10條
1 李政澤;韓毅;周斌;賈焰;;微博用戶分類的特征詞權(quán)重優(yōu)化及推薦策略[J];信息網(wǎng)絡(luò)安全;2012年08期
2 翟東海;杜佳;崔靜靜;聶洪玉;;基于雙粒度模型的中文情感特征詞提取研究[J];重慶郵電大學(xué)學(xué)報(自然科學(xué)版);2014年03期
3 李德容;干靜;張s,
本文編號:1817499
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1817499.html
最近更新
教材專著