天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于語義的文本向量表示方法研究

發(fā)布時(shí)間:2018-01-20 14:49

  本文關(guān)鍵詞: 文本表示 語義 文本分類 觀點(diǎn)抽取 詞向量 神經(jīng)網(wǎng)絡(luò) 出處:《中國科學(xué)技術(shù)大學(xué)》2017年碩士論文 論文類型:學(xué)位論文


【摘要】:互聯(lián)網(wǎng)技術(shù)的發(fā)展和普及使得人們可以快速的獲取信息,反過來人們獲取信息的方式也越來越依賴于互聯(lián)網(wǎng)。人們?cè)诨ヂ?lián)網(wǎng)上獲取信息的主要方式是通過文本,并且互聯(lián)網(wǎng)中的文本數(shù)目也呈現(xiàn)爆發(fā)式的增長。為了使人們更方便準(zhǔn)確的找到需要的信息,互聯(lián)網(wǎng)服務(wù)提供商需要對(duì)文本進(jìn)行分類、聚類以及排序等。這些任務(wù)通常需要將文本表示成向量形式以便應(yīng)用不同的機(jī)器學(xué)習(xí)模型。從用戶角度來說,需要根據(jù)文本的語義來對(duì)它們分類、聚類、排序等。語義是一種抽象的,高層次的特征,而現(xiàn)在廣泛使用的文本的詞袋表示將文本看成相互獨(dú)立的字符的集合,而沒有考慮這些字符的語義以及它們的關(guān)聯(lián),從而導(dǎo)致詞袋表示不夠泛化。在文本向量表示中包含進(jìn)文本更高層次的語義信息成為很多學(xué)者的研究目標(biāo);谡Z義的文本向量表示的優(yōu)點(diǎn)是能夠?qū)⑽谋居玫途S的稠密的向量表示起來,且這種表示更加泛化,也就是說即使兩個(gè)文本在表達(dá)相同意思時(shí)使用了不同的用詞,它們的基于語義的向量表示也是相似的,而詞袋模型不能捕捉到這種相似。主題模型,包括LDA,pLSI通過模擬文本的生成過程得到文本中隱含的主題,并將文本表示成在主題上的分布。深度神經(jīng)網(wǎng)絡(luò)能夠?qū)W習(xí)到數(shù)據(jù)的不同層次的特征因此也被用來得到文本的語義表示。本文以基于語義的文本向量表示為研究對(duì)象,開展了以下工作:1.在無監(jiān)督情況下,本文針對(duì)詞袋模型不能考慮詞之間的相似度而導(dǎo)致表示不夠泛化的問題以及維度災(zāi)難問題提出基于詞團(tuán)的表示(BOWL)。詞團(tuán)是語義相似的詞的集合,每一個(gè)詞團(tuán)表達(dá)了一個(gè)"概念",其相對(duì)于詞是更高層次,更抽象的特征,從而在文本表示中考慮到了詞的語義信息。BOWL表示的每一個(gè)維度的值使用k-max池化操作來計(jì)算。實(shí)驗(yàn)顯示了 BOWL表示的表示有效性和表示效率。2.在有監(jiān)督情況下,復(fù)雜的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)雖然能夠捕捉到更準(zhǔn)確的語義信息,但這種神經(jīng)網(wǎng)絡(luò)的訓(xùn)練非常耗時(shí)并且往往依賴GPU,本文在神經(jīng)網(wǎng)絡(luò)的輸入層將詞的詞向量求平均,經(jīng)過隱藏層的非線性變換得到文本的更高層次的語義向量表示,最后在文本的向量空間對(duì)文本分類。實(shí)驗(yàn)表明這種向量平均神經(jīng)網(wǎng)絡(luò)相對(duì)于低層次的詞袋表示大大提高了分類的準(zhǔn)確率。并且本文通過實(shí)驗(yàn)展示了神經(jīng)網(wǎng)絡(luò)的工作原理并分析了優(yōu)化的過程。3.針對(duì)具體的在商品評(píng)論文本中抽取觀點(diǎn)標(biāo)簽的任務(wù)中,傳統(tǒng)的基于詞匹配的方法不夠泛化的問題提出通過計(jì)算文本間的語義相似度的方式來匹配評(píng)論文本和觀點(diǎn)標(biāo)簽,并且對(duì)長句和短句本文設(shè)計(jì)不同的計(jì)算相似度的方法。這相當(dāng)于通過內(nèi)核方法隱式的將文本投影到語義空間計(jì)算它們的距離。實(shí)驗(yàn)表明這種方法大大提高了抽取的召回率,模型更加泛化。
[Abstract]:The development and popularization of Internet technology makes it possible to obtain information, in turn, the way people access to information is increasingly dependent on the Internet. The main way for people to obtain information on the Internet through the text, and the number of text in the Internet also showed explosive growth. In order to make people more convenient and accurate to find needed information that Internet service providers need to text classification, clustering and ranking. These tasks usually need to represent text into a vector form to apply different machine learning model. From the angle of the user, according to the semantics of the text to their classification, clustering, ranking. Semantic is an abstract, high-level features the text is now widely used in the bag of words that will set the text as independent character, without considering the semantics to these characters Correlation of them and the resulting bag of words in the text. That is not the generalization of vector representation of semantic information contained in the text into a higher level has become a research goal of many scholars. The advantages of text representation based on semantic vector is able to text with low dimensional dense vector representation to the representation and generalization are more. It is said that even if the two text express the same meaning in using different words and their semantic vector based representation is similar, and the bag of words model cannot capture this similar topic model, including LDA, pLSI through the analog text generation process has been implicated in the text topic and text representation distribution in the subject. The depth of the neural network can learn the different features of the data it was also used to obtain the semantics of the text. The text vector representation based on semantic representation. The object of study, carried out the following work: 1. under no supervision, according to the bag of words model does not consider the similarity between words and that the proposed generalization of the problem and not enough dimension curse word group based on (BOWL). The word group is a collection of semantic similar words, each word group expressed a "concept", the word is higher, more abstract features, resulting in the text representation considering the semantic information of the.BOWL word representation of each dimension value using the K-MAX pool operation to calculate. The experiment shows that BOWL expressed in the effectiveness and efficiency of.2. in said supervise the case, although the complex structure of neural network is able to capture semantic information more accurately, but this kind of neural network training is very time-consuming and often rely on GPU, the input layer in the neural network the word vector word average after hiding The nonlinear transformation layer are more high-level semantic vector representation of the text, the text vector space of text classification. Experimental results show that the average relative to the vector neural network low level said bag of words greatly improves the accuracy of classification. And through the experiment shows the working principle of neural network optimization and analysis.3. specific product reviews in text extraction task view tag, the traditional word matching method is proposed based on the generalization of the problem of insufficient semantic similarity between texts, the way to review papers and views and different labels, calculating the similarity of long and short sentences designed in this paper. This method is equivalent to through the kernel method of implicit semantic space of the text is projected to calculate their distance. The experimental results show that this method greatly improves the recall rate of extraction, The model is more generalized.

【學(xué)位授予單位】:中國科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前7條

1 劉全超;黃河燕;王亞s,

本文編號(hào):1448591


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1448591.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶a8e02***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com