天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

基于word2vec詞向量的文本分類研究

發(fā)布時(shí)間:2018-06-22 08:05

  本文選題:Word2vec模型 + 文本表示 ; 參考:《西南大學(xué)》2017年碩士論文


【摘要】:自動(dòng)文本分類技術(shù)在文本挖掘、自然語(yǔ)言處理以及機(jī)器學(xué)習(xí)等領(lǐng)域具有重要地位,它為信息檢索與文本管理提供了很多便利。近年來(lái)隨著互聯(lián)網(wǎng)技術(shù)的高速發(fā)展,文本數(shù)據(jù)每天都在迅速膨脹,比如用戶所發(fā)的微博動(dòng)態(tài)信息、各大新聞門戶網(wǎng)站的新聞內(nèi)容、用戶來(lái)往的電子郵件信息以及論壇、博客的帖子等。自動(dòng)文本分類恰好是處理和組織這些文本數(shù)據(jù)的有效工具,已經(jīng)在許多方面得到了應(yīng)用,如微博情感分類、垃圾郵件過(guò)濾以及新聞內(nèi)容自動(dòng)分發(fā)等。未來(lái)互聯(lián)網(wǎng)上的文本數(shù)據(jù)還會(huì)不斷增加,自動(dòng)文本分類技術(shù)將在這些領(lǐng)域發(fā)揮越來(lái)越重要的作用。自動(dòng)文本分類包括若干技術(shù),比如文本預(yù)處理、文本表示、特征選擇、特征抽取以及分類算法的選擇等,其中文本表示與分類算法的研究是這些技術(shù)中的關(guān)鍵,它們將直接影響到自動(dòng)文本分類的結(jié)果。目前大多數(shù)學(xué)者對(duì)文本分類技術(shù)的研究也主要側(cè)重于文本的特征選擇及抽取、文本表示以及分類算法的優(yōu)化方面。在眾多的文本表示模型中,基于詞頻-逆文本頻率(TF-IDF)加權(quán)的向量空間模型(VSM)是一種主流的文本表示模型(簡(jiǎn)稱VSM_TFIDF模型),它在學(xué)術(shù)界與工業(yè)界都有不錯(cuò)的表現(xiàn),但該模型并不能很好的表示文本的語(yǔ)義信息,它無(wú)法將文本中特征詞的上下文語(yǔ)義與句法信息考慮到模型之中。其次,常用的文本距離度量方式,比如歐氏距離、余弦距離等無(wú)法很好的衡量這類文本表示模型所表示的文本之間的相似度。針對(duì)以上問(wèn)題,本文借助于Word2vec詞向量將語(yǔ)義信息引入文本表示模型或文本距離度量方式之中,從而提升文本分類的效果。文中深入研究了Word2vec詞向量的生成機(jī)制,包括它的兩種訓(xùn)練模型(CBOW模型和Skip-gram模型),以及兩套提升詞向量訓(xùn)練效率的優(yōu)化方案(Hierarchical Softmax和Negative Sampling)。在此基礎(chǔ)上,本文將Word2vec詞向量引入到對(duì)文本表示模型以及文本距離度量方式的研究之中,主要的工作包括如下2個(gè)方面:(1)提出了一種基于Word2vec詞向量與VSM_TFIDF模型的多粒度多模型組合的文本表示模型——CombineTextVector。由于Word2vec詞向量可以很好的表示特征詞的語(yǔ)義信息,文中考慮將它與VSM_TFIDF模型結(jié)合起來(lái),優(yōu)勢(shì)互補(bǔ),提升文本表示的效果。文中首先將文本的類別信息嵌入TF-IDF加權(quán)公式,以提升加權(quán)因子的類別區(qū)分能力(我們將其命名為wTFIDF加權(quán)公式),然后與Word2vec詞向量結(jié)合,構(gòu)建了一種多粒度的文本表示模型Word2vec_wTFIDF,最后再將該模型與傳統(tǒng)的VSM_TFIDF模型結(jié)合,構(gòu)建CombineTextVector文本表示模型。為了驗(yàn)證新模型的性能,本文在復(fù)旦中文文本分類語(yǔ)料庫(kù)上設(shè)計(jì)實(shí)驗(yàn),并與多種主流的文本表示模型進(jìn)行對(duì)比。實(shí)驗(yàn)結(jié)果證明,新模型均能取得較高的分類F1值。(2)提出了一種基于Word2vec詞向量與EMD距離,并針對(duì)主題模型進(jìn)行距離度量的方式——TopMD距離度量。文中首先分析了傳統(tǒng)VSM_TFIDF模型和主題模型中常用的文本距離度量方式,針對(duì)文本間語(yǔ)義相似度無(wú)法很好度量的問(wèn)題,將EMD度量方式與Word2vec詞向量結(jié)合,提出了一種針對(duì)主題模型的TopMD距離度量方式。與常用度量方式相比,它能將更細(xì)粒度的特征詞之間的相似度考慮到TopMD距離之中。為了驗(yàn)證所提方法的有效性,本文分別在中文和英文的語(yǔ)料庫(kù)上進(jìn)行實(shí)驗(yàn),并與多種距離度量方式進(jìn)行對(duì)比。實(shí)驗(yàn)結(jié)果證明,相對(duì)于傳統(tǒng)方式,該方法可以提高主題模型的文本相似度度量效果。
[Abstract]:Automatic text classification technology has an important position in the fields of text mining, Natural Language Processing and machine learning. It provides a lot of convenience for information retrieval and text management. In recent years, with the rapid development of Internet technology, the text data is expanding rapidly every day, like the micro-blog dynamic information sent by the users, the big news gates. Automatic text classification is an effective tool for processing and organizing these text data, which has been applied in many aspects, such as micro-blog emotion classification, spam filtering and automatic distribution of news content. Text data will continue to increase, and automatic text classification technology will play a more and more important role in these fields. Automatic text classification includes several technologies, such as text preprocessing, text representation, feature selection, feature extraction and selection of classification algorithms, and the study of the Chinese representation and classification algorithm is the key to these techniques. Key, they will directly affect the result of automatic text classification. At present, most scholars mainly focus on the selection and extraction of text features, text representation and the optimization of classification algorithms. In a large number of text representation models, the vector space model based on word frequency inverse text frequency (TF-IDF) weighting is used in a large number of text representation models. (VSM) is a mainstream text representation model (VSM_TFIDF model). It has a good performance in both academia and industry, but the model can not express the semantic information of the text well. It can not consider the context semantics and syntactic information of the feature words in the model. Secondly, the common text distance measurement method is used. For example, the Euclidean distance and cosine distance can not be used to measure the similarity between the text expressed by the text representation model. In this paper, the semantic information is introduced into the text representation model or text distance measure with the help of the Word2vec word vector, thus the effect of the text classification is raised. In this paper, the Word is deeply studied. The generation mechanism of 2vec word vector, including its two training models (CBOW model and Skip-gram model), and two sets of optimization schemes (Hierarchical Softmax and Negative Sampling) for lifting word vector training efficiency (Hierarchical Softmax and Negative Sampling). On this basis, this paper introduces the vector of Word2vec word to the study of text representation model and text distance measurement. The main work includes the following 2 aspects: (1) a text representation model of multi granularity and multi model combination based on the Word2vec word vector and the VSM_TFIDF model is proposed. Because the word vector of the Word2vec word can express the semantic information of the feature words well, it is considered to combine it with the VSM_TFIDF model in this paper. In this paper, we first embed the text category information into the TF-IDF weighted formula to improve the classification ability of the weighted factor (we named it as the wTFIDF weighted formula), and then combined with the Word2vec word vector, a multi granularity text representation model, Word2vec_wTFIDF, was constructed, and then the model was then applied to the model. In conjunction with the traditional VSM_TFIDF model, the CombineTextVector text representation model is constructed. In order to verify the performance of the new model, this paper designs experiments on Fudan Chinese text classification corpus and compares it with a variety of mainstream text representation models. The experimental results show that the new model can achieve higher classification F1 values. (2) a kind of basis is proposed. The distance between the Word2vec word vector and the EMD, and the way of distance measurement for the theme model, the distance measurement of the TopMD. First, the text distance measurement in the traditional VSM_TFIDF model and the theme model is analyzed. In view of the problem that the semantic similarity between the text can not be well measured, the EMD measure and the Word2vec word vector are connected. In addition, a TopMD distance measurement for the theme model is proposed. Compared with the common measure, it can take the similarity between the more finer feature words into the TopMD distance. In order to verify the validity of the proposed method, this paper carries out experiments on the Chinese and English Corpus respectively, and goes in with a variety of distance measures. The experimental results show that compared with the traditional way, this method can improve the text similarity measurement effect of the topic model.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前1條

1 鄧琦;蘇一丹;曹波;閉劍婷;;中文文本體裁分類中特征選擇的研究[J];計(jì)算機(jī)工程;2008年23期

相關(guān)博士學(xué)位論文 前1條

1 閆琰;基于深度學(xué)習(xí)的文本表示與分類方法研究[D];北京科技大學(xué);2016年

相關(guān)碩士學(xué)位論文 前6條

1 王明亞;基于詞向量的文本分類算法研究與改進(jìn)[D];華東師范大學(xué);2016年

2 蔡慧蘋;基于卷積神經(jīng)網(wǎng)絡(luò)的短文本分類方法研究[D];西南大學(xué);2016年

3 竇光輝;搜索引擎查詢糾錯(cuò)的關(guān)鍵技術(shù)研究[D];北方工業(yè)大學(xué);2014年

4 王小青;中文文本分類特征選擇方法研究[D];西南大學(xué);2010年

5 榮光;中文文本分類方法研究[D];山東師范大學(xué);2009年

6 何金鳳;基于中文信息檢索的文本預(yù)處理研究[D];電子科技大學(xué);2008年



本文編號(hào):2052171

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2052171.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶73da0***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
亚洲高清中文字幕一区二三区 | 日本中文在线不卡视频| 亚洲国产精品无遮挡羞羞| 九九热这里只有精品视频| 日本99精品在线观看| 99久久精品午夜一区二| 日韩精品毛片视频免费看| 色婷婷亚洲精品综合网| 日本丰满大奶熟女一区二区| 91人妻人澡人人爽人人精品 | 老熟妇2久久国内精品| 国产老女人性生活视频| 嫩草国产福利视频一区二区| 人妻久久这里只有精品| 午夜福利视频偷拍91| 日本一区二区三区久久娇喘| 精品欧美日韩一二三区| 久久99午夜福利视频| 国产成人午夜福利片片| 国产又粗又长又大高潮视频| 国产爆操白丝美女在线观看| 久久国产成人精品国产成人亚洲| 中文字幕乱码一区二区三区四区| 亚洲精品美女三级完整版视频| 中文字幕无线码一区欧美| 99久久国产精品亚洲| 91人妻人人做人碰人人九色| 欧美日韩亚洲国产av| 中文字幕日韩一区二区不卡| 国产一区二区三区午夜精品| 美女被草的视频在线观看| 中国日韩一级黄色大片| 91人妻人人精品人人爽| 日本在线视频播放91| 国产超薄黑色肉色丝袜| 欧美日韩国产综合特黄| 中文字幕亚洲精品人妻| 视频一区日韩经典中文字幕| 久久综合狠狠综合久久综合| 精品人妻少妇二区三区| 国产亚洲神马午夜福利|