基于word2vec詞向量的文本分類研究

發(fā)布時間：2018-06-22 08:05

本文選題：Word2vec模型 + 文本表示��；參考：《西南大學(xué)》2017年碩士論文

【摘要】：自動文本分類技術(shù)在文本挖掘、自然語言處理以及機(jī)器學(xué)習(xí)等領(lǐng)域具有重要地位,它為信息檢索與文本管理提供了很多便利。近年來隨著互聯(lián)網(wǎng)技術(shù)的高速發(fā)展,文本數(shù)據(jù)每天都在迅速膨脹,比如用戶所發(fā)的微博動態(tài)信息、各大新聞門戶網(wǎng)站的新聞內(nèi)容、用戶來往的電子郵件信息以及論壇、博客的帖子等。自動文本分類恰好是處理和組織這些文本數(shù)據(jù)的有效工具,已經(jīng)在許多方面得到了應(yīng)用,如微博情感分類、垃圾郵件過濾以及新聞內(nèi)容自動分發(fā)等。未來互聯(lián)網(wǎng)上的文本數(shù)據(jù)還會不斷增加,自動文本分類技術(shù)將在這些領(lǐng)域發(fā)揮越來越重要的作用。自動文本分類包括若干技術(shù),比如文本預(yù)處理、文本表示、特征選擇、特征抽取以及分類算法的選擇等,其中文本表示與分類算法的研究是這些技術(shù)中的關(guān)鍵,它們將直接影響到自動文本分類的結(jié)果。目前大多數(shù)學(xué)者對文本分類技術(shù)的研究也主要側(cè)重于文本的特征選擇及抽取、文本表示以及分類算法的優(yōu)化方面。在眾多的文本表示模型中,基于詞頻-逆文本頻率(TF-IDF)加權(quán)的向量空間模型(VSM)是一種主流的文本表示模型(簡稱VSM_TFIDF模型),它在學(xué)術(shù)界與工業(yè)界都有不錯的表現(xiàn),但該模型并不能很好的表示文本的語義信息,它無法將文本中特征詞的上下文語義與句法信息考慮到模型之中。其次,常用的文本距離度量方式,比如歐氏距離、余弦距離等無法很好的衡量這類文本表示模型所表示的文本之間的相似度。針對以上問題,本文借助于Word2vec詞向量將語義信息引入文本表示模型或文本距離度量方式之中,從而提升文本分類的效果。文中深入研究了Word2vec詞向量的生成機(jī)制,包括它的兩種訓(xùn)練模型(CBOW模型和Skip-gram模型),以及兩套提升詞向量訓(xùn)練效率的優(yōu)化方案(Hierarchical Softmax和Negative Sampling)。在此基礎(chǔ)上,本文將Word2vec詞向量引入到對文本表示模型以及文本距離度量方式的研究之中,主要的工作包括如下2個方面:(1)提出了一種基于Word2vec詞向量與VSM_TFIDF模型的多粒度多模型組合的文本表示模型——CombineTextVector。由于Word2vec詞向量可以很好的表示特征詞的語義信息,文中考慮將它與VSM_TFIDF模型結(jié)合起來,優(yōu)勢互補(bǔ),提升文本表示的效果。文中首先將文本的類別信息嵌入TF-IDF加權(quán)公式,以提升加權(quán)因子的類別區(qū)分能力(我們將其命名為wTFIDF加權(quán)公式),然后與Word2vec詞向量結(jié)合,構(gòu)建了一種多粒度的文本表示模型Word2vec_wTFIDF,最后再將該模型與傳統(tǒng)的VSM_TFIDF模型結(jié)合,構(gòu)建CombineTextVector文本表示模型。為了驗證新模型的性能,本文在復(fù)旦中文文本分類語料庫上設(shè)計實驗,并與多種主流的文本表示模型進(jìn)行對比。實驗結(jié)果證明,新模型均能取得較高的分類F1值。(2)提出了一種基于Word2vec詞向量與EMD距離,并針對主題模型進(jìn)行距離度量的方式——TopMD距離度量。文中首先分析了傳統(tǒng)VSM_TFIDF模型和主題模型中常用的文本距離度量方式,針對文本間語義相似度無法很好度量的問題,將EMD度量方式與Word2vec詞向量結(jié)合,提出了一種針對主題模型的TopMD距離度量方式。與常用度量方式相比,它能將更細(xì)粒度的特征詞之間的相似度考慮到TopMD距離之中。為了驗證所提方法的有效性,本文分別在中文和英文的語料庫上進(jìn)行實驗,并與多種距離度量方式進(jìn)行對比。實驗結(jié)果證明,相對于傳統(tǒng)方式,該方法可以提高主題模型的文本相似度度量效果。
[Abstract]:Automatic text classification technology has an important position in the fields of text mining, Natural Language Processing and machine learning. It provides a lot of convenience for information retrieval and text management. In recent years, with the rapid development of Internet technology, the text data is expanding rapidly every day, like the micro-blog dynamic information sent by the users, the big news gates. Automatic text classification is an effective tool for processing and organizing these text data, which has been applied in many aspects, such as micro-blog emotion classification, spam filtering and automatic distribution of news content. Text data will continue to increase, and automatic text classification technology will play a more and more important role in these fields. Automatic text classification includes several technologies, such as text preprocessing, text representation, feature selection, feature extraction and selection of classification algorithms, and the study of the Chinese representation and classification algorithm is the key to these techniques. Key, they will directly affect the result of automatic text classification. At present, most scholars mainly focus on the selection and extraction of text features, text representation and the optimization of classification algorithms. In a large number of text representation models, the vector space model based on word frequency inverse text frequency (TF-IDF) weighting is used in a large number of text representation models. (VSM) is a mainstream text representation model (VSM_TFIDF model). It has a good performance in both academia and industry, but the model can not express the semantic information of the text well. It can not consider the context semantics and syntactic information of the feature words in the model. Secondly, the common text distance measurement method is used. For example, the Euclidean distance and cosine distance can not be used to measure the similarity between the text expressed by the text representation model. In this paper, the semantic information is introduced into the text representation model or text distance measure with the help of the Word2vec word vector, thus the effect of the text classification is raised. In this paper, the Word is deeply studied. The generation mechanism of 2vec word vector, including its two training models (CBOW model and Skip-gram model), and two sets of optimization schemes (Hierarchical Softmax and Negative Sampling) for lifting word vector training efficiency (Hierarchical Softmax and Negative Sampling). On this basis, this paper introduces the vector of Word2vec word to the study of text representation model and text distance measurement. The main work includes the following 2 aspects: (1) a text representation model of multi granularity and multi model combination based on the Word2vec word vector and the VSM_TFIDF model is proposed. Because the word vector of the Word2vec word can express the semantic information of the feature words well, it is considered to combine it with the VSM_TFIDF model in this paper. In this paper, we first embed the text category information into the TF-IDF weighted formula to improve the classification ability of the weighted factor (we named it as the wTFIDF weighted formula), and then combined with the Word2vec word vector, a multi granularity text representation model, Word2vec_wTFIDF, was constructed, and then the model was then applied to the model. In conjunction with the traditional VSM_TFIDF model, the CombineTextVector text representation model is constructed. In order to verify the performance of the new model, this paper designs experiments on Fudan Chinese text classification corpus and compares it with a variety of mainstream text representation models. The experimental results show that the new model can achieve higher classification F1 values. (2) a kind of basis is proposed. The distance between the Word2vec word vector and the EMD, and the way of distance measurement for the theme model, the distance measurement of the TopMD. First, the text distance measurement in the traditional VSM_TFIDF model and the theme model is analyzed. In view of the problem that the semantic similarity between the text can not be well measured, the EMD measure and the Word2vec word vector are connected. In addition, a TopMD distance measurement for the theme model is proposed. Compared with the common measure, it can take the similarity between the more finer feature words into the TopMD distance. In order to verify the validity of the proposed method, this paper carries out experiments on the Chinese and English Corpus respectively, and goes in with a variety of distance measures. The experimental results show that compared with the traditional way, this method can improve the text similarity measurement effect of the topic model.
【學(xué)位授予單位】：西南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前1條

1 鄧琦;蘇一丹;曹波;閉劍婷;;中文文本體裁分類中特征選擇的研究[J];計算機(jī)工程;2008年23期

相關(guān)博士學(xué)位論文前1條

1 閆琰;基于深度學(xué)習(xí)的文本表示與分類方法研究[D];北京科技大學(xué);2016年

相關(guān)碩士學(xué)位論文前6條

1 王明亞;基于詞向量的文本分類算法研究與改進(jìn)[D];華東師范大學(xué);2016年

2 蔡慧蘋;基于卷積神經(jīng)網(wǎng)絡(luò)的短文本分類方法研究[D];西南大學(xué);2016年

3 竇光輝;搜索引擎查詢糾錯的關(guān)鍵技術(shù)研究[D];北方工業(yè)大學(xué);2014年

4 王小青;中文文本分類特征選擇方法研究[D];西南大學(xué);2010年

5 榮光;中文文本分類方法研究[D];山東師范大學(xué);2009年

6 何金鳳;基于中文信息檢索的文本預(yù)處理研究[D];電子科技大學(xué);2008年

，

本文編號：2052171

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2052171.html

上一篇：基于海量結(jié)果數(shù)據(jù)的潛艇戰(zhàn)術(shù)行動方案優(yōu)化
下一篇：濱海新能油氣站LNG卡系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于word2vec詞向量的文本分類研究