基于Word Embedding的短文本特征擴(kuò)展方法研究
發(fā)布時(shí)間:2018-05-20 13:11
本文選題:Word + Embedding。 參考:《吉林大學(xué)》2017年碩士論文
【摘要】:隨著網(wǎng)絡(luò)的發(fā)展和移動(dòng)設(shè)備的普及,人與人之間交流變的更加及時(shí)、方便。短信、QQ、微博等社交媒體已成為我們生活中不可或缺的一部分,信息的形式也變得更加短小和自由。網(wǎng)絡(luò)中的短文本的數(shù)量快速增長,給傳統(tǒng)的基于長文本的自動(dòng)信息處理和文本挖掘技術(shù)帶來了新的挑戰(zhàn)。如何解決短文本自身特征稀疏、特征覆蓋率低等問題,已經(jīng)成為很多學(xué)者研究的重點(diǎn)。其中,最直接有效的方法是擴(kuò)展短文本的特征。深度學(xué)習(xí)的不斷發(fā)展,使其在各個(gè)領(lǐng)域中得到了廣泛的應(yīng)用,結(jié)合深度學(xué)習(xí)的自然語言處理技術(shù)也成為研究的一種必然趨勢,其中Word Embedding就是這個(gè)過程中的一個(gè)重要的成果。Word Embedding是詞的一種向量表示方法,不同于傳統(tǒng)相互獨(dú)立的詞表示,它將詞按照語義間的關(guān)聯(lián)強(qiáng)度分布在相對低維度的向量空間中,同時(shí)編碼了語言中顯性和隱性的規(guī)則。這也使詞向量不再是單純用來識(shí)別單詞的符號(hào),同時(shí)也蘊(yùn)含著很多語義信息。本文將Word Embedding作為短文本的特征擴(kuò)展的依據(jù),提出來一種新的文本特征擴(kuò)展方法。該方法豐富了短文本的語義信息,同時(shí)擴(kuò)大了文本特征覆蓋率,具體研究內(nèi)容如下:1.基于大規(guī)模的語料庫訓(xùn)練Word Embedding。Word Embedding的訓(xùn)練模型為神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)的語言模型,本文根據(jù)Word Embedding的發(fā)展過程和不同需求介紹四種常見的模型:神經(jīng)網(wǎng)絡(luò)語言模型、循環(huán)神經(jīng)網(wǎng)絡(luò)模型、CBOW和Skip_gram。同時(shí)結(jié)合其他學(xué)者對模型的研究和本文的任務(wù)需求,選擇了Skip_gram模型作為Word Embedding訓(xùn)練模型。同時(shí)選擇內(nèi)容豐富、數(shù)據(jù)量較大的WIKI百科英文數(shù)據(jù)庫作為模型的訓(xùn)練數(shù)據(jù),得到了200多萬個(gè)單詞對應(yīng)的Word Embedding表示。2.依據(jù)Word Embedding間的特性,使用向量間計(jì)算完成短文本范圍內(nèi)的簡單推理。部分Word Embedding編碼的語言規(guī)則,可以使用Word Embedding間的減法和加法運(yùn)算來表示,本文將這一特性用在短文本對應(yīng)的有序的詞序列上,獲得與短文本語義相關(guān)的向量表達(dá)。運(yùn)算得到的推理向量與Word Embedding屬于同一個(gè)向量空間。3.使用Word Embedding聚類表示擴(kuò)展特征空間。不同于傳統(tǒng)的小粒度的語義表示單位(詞、短語、概念等),本文基于Word Embedding空間分布特點(diǎn),通過聚類得到基于語義相近度自動(dòng)劃分的“語義單元”,以“語義單元”作為擴(kuò)展特征的特征項(xiàng),且相同維度的向量表達(dá)(包括短文本對應(yīng)的Word Embedding向量和之前介紹的Word Embedding的推理向量)都可以映射到擴(kuò)展特征空間上。最后,本文使用基于Word Embedding的短文本特征擴(kuò)展方法進(jìn)行了短文本分類和短文本聚類實(shí)驗(yàn)。在谷歌搜索片段和China Daily新聞?wù)獌煞N數(shù)據(jù)集上,分類精度相較于基于LDA的方法分別提高了3.7%、1.0%,聚類F值相較于傳統(tǒng)聚類方法分別提高30.64%、17.54%。實(shí)驗(yàn)結(jié)果表明,本文方法可以更好地表達(dá)短文本的信息,改善了短文本特征稀疏、特征覆蓋率低等問題。
[Abstract]:With the development of network and the popularization of mobile devices, communication between people becomes more timely and convenient. Social media such as SMS, QQ, Weibo have become an integral part of our lives, and the form of information has become shorter and freer. The rapid growth of the number of short text in the network brings new challenges to the traditional automatic information processing and text mining technology based on long text. How to solve the problems of sparse features and low coverage of short text has become the focus of many scholars. The most direct and effective method is to extend the feature of short text. With the continuous development of deep learning, it has been widely used in various fields. The natural language processing technology combined with deep learning has become an inevitable trend of research. Word Embedding is an important achievement in this process. Word Embedding is a vector representation method of words, which is different from the traditional independent word representation. It distributes the words in a vector space of relatively low dimension according to the correlation strength between semantics. It encodes both explicit and implicit rules in language. This makes the word vector not only used to identify words, but also contains a lot of semantic information. In this paper, Word Embedding is taken as the basis of feature extension of short text, and a new method of text feature extension is proposed. This method not only enriches the semantic information of short text, but also expands the coverage of text features. The specific research contents are as follows: 1. The training model of Word Embedding.Word Embedding based on large-scale corpus training is a language model of neural network structure. According to the development process and different needs of Word Embedding, this paper introduces four common models: neural network language model. The circulatory neural network model is CBOW and SkipSP-gram. At the same time, according to other scholars' research on the model and the task requirements of this paper, Skip_gram model is chosen as the Word Embedding training model. At the same time, the English database of WIKI encyclopedia, which is rich in content and large amount of data, is chosen as the training data of the model, and the corresponding Word Embedding representation of more than 2 million words is obtained. 2. 2. According to the characteristics of Word Embedding, the simple reasoning within the scope of text is accomplished by vector computation. Some of the language rules encoded by Word Embedding can be expressed by subtraction and addition between Word Embedding. This feature is applied to the ordered word sequences corresponding to short texts to obtain vector expressions related to the semantics of short texts. The inference vector obtained by operation belongs to the same vector space as Word Embedding. The extended feature space is represented by Word Embedding clustering. Different from the traditional small-grained semantic representation units (words, phrases, concepts, etc.), this paper, based on the spatial distribution of Word Embedding, obtains the "semantic units" which are automatically partitioned based on the degree of semantic similarity by clustering. The "semantic unit" is used as the feature item of extended feature, and the vector representation of the same dimension (including the Word Embedding vector corresponding to the short text and the inference vector of Word Embedding introduced earlier) can be mapped to the extended feature space. Finally, short text classification and short text clustering experiments based on Word Embedding are carried out. On Google search segment and China Daily news summary, the classification accuracy is 3.740% higher than that based on LDA, and the clustering F value is 30.64% 17.54% higher than that of traditional clustering method. The experimental results show that this method can better express the information of short text and improve the problems of sparse feature and low coverage of short text.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 孟欣;基于Word Embedding的短文本特征擴(kuò)展方法研究[D];吉林大學(xué);2017年
,本文編號(hào):1914747
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1914747.html
最近更新
教材專著