基于上下文分析的詞及短語復(fù)述抽取研究

發(fā)布時間：2018-05-23 21:12

本文選題：復(fù)述抽取 + 上下文信息��；參考：《哈爾濱工業(yè)大學(xué)》2017年碩士論文

【摘要】：現(xiàn)實生活中,人們在表達相同信息時往往會使用不同的文本描述,這就是復(fù)述現(xiàn)象。因為復(fù)述現(xiàn)象的存在,也使得眾多自然語言處理任務(wù)變得復(fù)雜困難。詞及短語復(fù)述抽取是從語料中抽取得到表達相同語義的詞匯和短語,抽取得到的復(fù)述資源在問答、信息檢索、機器翻譯、文本生成等自然語言處理任務(wù)中都有重要的應(yīng)用,能夠提升相關(guān)自然語言處理系統(tǒng)性能。在本文基于上下文分析的詞及短語復(fù)述抽取研究中,主要包含以下三個方面的研究內(nèi)容:基于上下文分析的詞匯級復(fù)述抽取方法研究、基于樞軸法的短語級復(fù)述抽取方法研究以及基于上下文分析的短語級復(fù)述抽取方法研究。首先,本文提出了基于上下文分析的詞匯復(fù)述抽取方法。目前詞匯復(fù)述抽取研究中主要是基于樞軸法從雙語平行語料中抽取詞匯復(fù)述。本文使用樞軸法的思想,使用中文詞匯的在線翻譯資源來抽取候選詞匯復(fù)述,從而避免雙語平行語料的對齊錯誤而導(dǎo)致抽取得到錯誤復(fù)述。使用詞匯上下文來學(xué)習(xí)詞匯向量,結(jié)合前饋神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)得到的詞匯復(fù)述得分以及詞向量之間的相似度得分作為詞匯復(fù)述的最終得分,使用最終得分對詞匯復(fù)述資源進行排序過濾。使用上下文等信息對詞匯復(fù)述資源過濾可以減少因為外文翻譯一詞多義而導(dǎo)致抽取得到的錯誤復(fù)述。對抽取得到的詞匯復(fù)述資源進行人工評價結(jié)果顯示該方法抽取得到的詞匯復(fù)述資源質(zhì)量優(yōu)于傳統(tǒng)樞軸法抽取得到的詞匯復(fù)述資源。其次,在目前常用的基于樞軸法抽取短語復(fù)述的基礎(chǔ)上,本文針對該方法因為雙語對齊錯誤以及外文翻譯多義問題導(dǎo)致抽取得到錯誤短語復(fù)述的問題,對抽取得到的短語復(fù)述資源分別使用翻譯概率以及上下文信息進行過濾。實驗結(jié)果表明,使用上下文信息對候選短語復(fù)述資源進行過濾可以大幅提升抽取得到的短語復(fù)述資源質(zhì)量。最后,本文提出基于上下文分析的短語復(fù)述抽取方法。該方法中使用兩層Bi LSTM-CRF模型對中文單語語料進行短語劃分,然后使用深度學(xué)習(xí)模型學(xué)習(xí)短語的向量表示,將短語向量的余弦相似度值高的短語抽取作為候選短語復(fù)述。并使用詞匯的英文翻譯對這些候選短語復(fù)述進行過濾。提出短語上下文向量學(xué)習(xí)方法,使用短語上下文向量相似度對候選短語復(fù)述進行排序。實驗結(jié)果表明,神經(jīng)網(wǎng)絡(luò)模型可以學(xué)習(xí)短語語義向量表示,經(jīng)過過濾排序之后的短語復(fù)述資源質(zhì)量遠(yuǎn)高于基于樞軸法抽取得到的短語復(fù)述質(zhì)量。
[Abstract]:In real life, people often use different text descriptions when expressing the same information, which is the phenomenon of repetition. Because of the existence of retelling phenomenon, many natural language processing tasks become complex and difficult. Word and phrase retelling extraction is the extraction of words and phrases that express the same semantics from the corpus. The extracted retelling resources have important applications in natural language processing tasks such as question and answer, information retrieval, machine translation, text generation and so on. Can improve the performance of related natural language processing systems. In this paper, the extraction of words and phrases based on context analysis mainly includes the following three aspects: the research of lexical level extraction method based on context analysis. Research on phrase level repetition extraction method based on pivot method and phrase level repeat extraction method based on context analysis. Firstly, this paper proposes a lexical repetition extraction method based on context analysis. At present, lexical repetition extraction is mainly based on pivot method from bilingual parallel corpus. In this paper, the idea of pivot method is used to extract candidate lexical repetition using online translation resources of Chinese vocabulary, so as to avoid the alignment error of bilingual parallel corpus and result in error repetition of extraction. Vocabulary vector is learned by lexical context, and the score of word repetition and the similarity between word vectors are used as the final score of vocabulary retelling, which is based on feedforward neural network learning. The final score is used to sort and filter the word retelling resources. Using contextual information to filter lexical retelling resources can reduce the misrepresentation caused by the polysemy of foreign language translation. The results of manual evaluation of the extracted lexical repetition resources show that the quality of the lexical repetition resources extracted by this method is superior to that of the lexical repetition resources extracted by the traditional pivot method. Secondly, on the basis of the pivot method, this paper aims at the problem that the paraphrase can be extracted by this method because of the error of bilingual alignment and the polysemy of foreign language translation. The extracted phrase repetition resources are filtered using translation probability and context information respectively. The experimental results show that using context information to filter candidate phrase recitation resources can greatly improve the quality of extracted phrase recitation resources. Finally, a method of phrase repetition extraction based on context analysis is proposed. In this method, the two-layer Bi LSTM-CRF model is used to divide the Chinese monolingual corpus, and then the advanced learning model is used to study the vector representation of the phrase. The phrase with high cosine similarity of the phrase vector is extracted as a candidate phrase. These candidate phrases are filtered by English translation. A learning method of phrase context vector is proposed, and the similarity of phrase context vector is used to sort candidate phrase retelling. The experimental results show that the neural network model can learn the expression of phrase semantic vector, and the quality of phrase repeat resource after filtering and sorting is much higher than that of phrase recitation based on pivot method.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.1

【參考文獻】

相關(guān)期刊論文前3條

1 何賢江;何維維;左航;;一種句詞五特征融合模型的復(fù)述研究[J];四川大學(xué)學(xué)報(工程科學(xué)版);2012年06期

2 趙世奇;劉挺;李生;;基于自動構(gòu)建語料庫的詞匯級復(fù)述研究[J];電子學(xué)報;2009年05期

3 張玉潔,山本和英;漢語語句的自動改寫[J];中文信息學(xué)報;2003年06期

相關(guān)博士學(xué)位論文前1條

1 張偉男;社區(qū)型問答中問句檢索關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2014年

，

本文編號：1926421

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1926421.html

上一篇：鞍鋼設(shè)備管理系統(tǒng)的設(shè)計與實現(xiàn)
下一篇：基于網(wǎng)絡(luò)虛擬化技術(shù)的服務(wù)語義事件的監(jiān)測、分析以及控制的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于上下文分析的詞及短語復(fù)述抽取研究