文本表示模型和特征選擇算法研究
本文選題:文本分類(lèi) + 特征選擇。 參考:《中國(guó)科學(xué)技術(shù)大學(xué)》2017年碩士論文
【摘要】:文本分類(lèi)是一種處理非結(jié)構(gòu)信息的有效手段,在機(jī)器學(xué)習(xí)和信息檢索等領(lǐng)域得到了廣泛研究和應(yīng)用。然而由于文本特征具有高維性、高稀疏性,因此文本分類(lèi)的效果和速度高度依賴(lài)于特征選擇方法和文本表示模型的選取。本文在文本特征選擇和文本表示模型兩個(gè)方面展開(kāi)研究,主要工作如下:(1)傳統(tǒng)的基于統(tǒng)計(jì)的特征選擇方法,沒(méi)有考慮特征的語(yǔ)義。為此,本文提出基于LDA詞向量和Word2vec詞向量的特征選擇方法,分別從主題和詞語(yǔ)上下文關(guān)系上,學(xué)習(xí)特征的語(yǔ)義概念。特征選擇完成后,利用向量空間模型,對(duì)語(yǔ)料進(jìn)行分類(lèi)。在復(fù)旦語(yǔ)料上的實(shí)驗(yàn)結(jié)果表明,基于詞向量的特征選擇分類(lèi)效果相對(duì)于傳統(tǒng)的特征選擇得到了一定的改善。并且,基于詞向量的特征選擇是一種無(wú)監(jiān)督的方法,無(wú)需標(biāo)注數(shù)據(jù)集。(2)LDA模型(Latent Dirichlet Allocation)沒(méi)有對(duì)輸入的特征進(jìn)行選擇,因?yàn)楹写罅繉?duì)主題表達(dá)沒(méi)有意義的詞,影響主題質(zhì)量。針對(duì)這種情況,本文提出一種基于遺傳算法的文本特征選擇,預(yù)先使用遺傳算法對(duì)原始的特征空間降低維,使得LDA能夠在更有意義的特征空間上進(jìn)行主題分配。對(duì)復(fù)旦語(yǔ)料庫(kù)進(jìn)行分類(lèi)實(shí)驗(yàn),分類(lèi)效果得到了改善。同時(shí)本文提出的遺傳算法用于特征選擇是自適應(yīng)的,無(wú)需事先確定特征選擇比例。LDA生成的主題中存在部分垃圾主題,一些主題是不相關(guān)的特征詞集合。當(dāng)前主要用通過(guò)手工檢查找有意義的主題。主題自動(dòng)排序的方法,目前只有TSR(Topic Significance Ranking)。TSR方法步驟比較多,且只考慮主題與垃圾主題的距離,沒(méi)有考慮主題之間的關(guān)系。針對(duì)主題重要性排序,本文提出一種最大垃圾主題距離-最小相似度的主題重要性排序方法。實(shí)驗(yàn)結(jié)果表明,本文提出的主題重要性排序方法,簡(jiǎn)單高效,能夠識(shí)別出有意義的主題。(3)LF-LDA模型(latent feature-LDA)結(jié)合詞向量訓(xùn)練模型,文本分類(lèi)效果優(yōu)于LDA。本文在LF-LDA模型的基礎(chǔ)上,提出了基于LF-LDA模型結(jié)合Word2vec的文本表示模型,利用LF-LDA生成的主題向量與Word2vec表示的文檔向量的距離表示文本。此外,還提出了一種基于主題向量的文本表示模型,利用LF-LDA生成的主題向量的加權(quán)組合表示文檔。在StackOverflow短文本數(shù)據(jù)集上實(shí)驗(yàn)表明,LF-LDA結(jié)合Word2vec的文本表示模型分類(lèi)效果優(yōu)于LF-LDA、LDA與Word2vec結(jié)合的文本表示模型。基于主題向量的文本表示模型分類(lèi)效果和LF-LDA相近。
[Abstract]:Text classification is an effective means to deal with unstructured information. It has been widely studied and applied in machine learning and information retrieval. However, due to the high dimension and sparsity of text features, the effect and speed of text classification depend heavily on the selection of feature selection method and text representation model. In this paper, two aspects of text feature selection and text representation model are studied. The main work is as follows: 1) the traditional statistical feature selection method does not take feature semantics into account. In this paper, a feature selection method based on LDA word vector and Word2vec word vector is proposed to study the semantic concept of feature in terms of topic and word context, respectively. After feature selection is completed, the corpus is classified by vector space model. The experimental results on Fudan corpus show that the classification effect of feature selection based on word vector is better than that of traditional feature selection. Moreover, the feature selection based on word vector is an unsupervised method, which does not need to label the data set. The Latent Dirichlet allocation model does not select the input feature, because there are a lot of words which have no meaning to the topic expression, which affect the topic quality. In this paper, a text feature selection based on genetic algorithm is proposed, in which genetic algorithm is used to reduce the dimension of the original feature space, so that LDA can assign topics in a more meaningful feature space. The classification effect of Fudan corpus is improved. At the same time, the genetic algorithm proposed in this paper is adaptive for feature selection, and there are some garbage topics in the theme generated by feature selection ratio. LDA, and some topics are irrelevant feature word sets. The current use of manual inspection to find a meaningful theme. At present, there are only a lot of TSR(Topic Significance Ranking).TSR methods to sort topics automatically, and only the distance between topics and garbage topics is considered, and the relationship between topics is not considered. In this paper, a method of topic importance ranking based on maximum garbage topic distance and minimum similarity is proposed. The experimental results show that the method proposed in this paper is simple and efficient, and it can recognize the meaningful topic, the LF-LDA model and the word vector training model, and the text classification effect is better than that of the LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA-LDA model. Based on the LF-LDA model, a text representation model based on LF-LDA model and Word2vec is proposed in this paper. The text is represented by the distance between the topic vector generated by LF-LDA and the document vector represented by Word2vec. In addition, a text representation model based on topic vectors is proposed, which uses the weighted combination of topic vectors generated by LF-LDA to represent documents. The experiment on StackOverflow short text dataset shows that the classification effect of LF-LDA combined with Word2vec is better than that of LF-LDA-LDA combined with Word2vec. The classification effect of text representation model based on topic vector is similar to that of LF-LDA.
【學(xué)位授予單位】:中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 唐明;朱磊;鄒顯春;;基于Word2Vec的一種文檔向量表示[J];計(jì)算機(jī)科學(xué);2016年06期
2 韓俊明;王煒;;基于LDA的軟件演化確認(rèn)建模[J];計(jì)算機(jī)科學(xué);2015年S2期
3 史慶偉;從世源;;基于mRMR和LDA主題模型的文本分類(lèi)研究[J];計(jì)算機(jī)工程與應(yīng)用;2016年05期
4 李躍鵬;金翠;及俊川;;基于word2vec的關(guān)鍵詞提取算法[J];科研信息化技術(shù)與應(yīng)用;2015年04期
5 沈競(jìng);;基于信息增益的LDA模型的短文本分類(lèi)[J];重慶文理學(xué)院學(xué)報(bào)(自然科學(xué)版);2011年06期
6 徐戈;王厚峰;;自然語(yǔ)言處理中主題模型的發(fā)展[J];計(jì)算機(jī)學(xué)報(bào);2011年08期
7 周建英;王飛躍;曾大軍;;分層Dirichlet過(guò)程及其應(yīng)用綜述[J];自動(dòng)化學(xué)報(bào);2011年04期
8 刁宇峰;楊亮;林鴻飛;;基于LDA模型的博客垃圾評(píng)論發(fā)現(xiàn)[J];中文信息學(xué)報(bào);2011年01期
9 張啟宇;朱玲;張雅萍;;中文分詞算法研究綜述[J];情報(bào)探索;2008年11期
10 曹娟;張勇東;李錦濤;唐勝;;一種基于密度的自適應(yīng)最優(yōu)LDA模型選擇方法[J];計(jì)算機(jī)學(xué)報(bào);2008年10期
相關(guān)碩士學(xué)位論文 前2條
1 江大鵬;基于詞向量的短文本分類(lèi)方法研究[D];浙江大學(xué);2015年
2 董露露;基于特征選擇及LDA模型的中文文本分類(lèi)研究與實(shí)現(xiàn)[D];安徽大學(xué);2014年
,本文編號(hào):1890338
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1890338.html