天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

基于上下文分析的詞及短語復(fù)述抽取研究

發(fā)布時(shí)間:2018-05-23 21:12

  本文選題:復(fù)述抽取 + 上下文信息。 參考:《哈爾濱工業(yè)大學(xué)》2017年碩士論文


【摘要】:現(xiàn)實(shí)生活中,人們?cè)诒磉_(dá)相同信息時(shí)往往會(huì)使用不同的文本描述,這就是復(fù)述現(xiàn)象。因?yàn)閺?fù)述現(xiàn)象的存在,也使得眾多自然語言處理任務(wù)變得復(fù)雜困難。詞及短語復(fù)述抽取是從語料中抽取得到表達(dá)相同語義的詞匯和短語,抽取得到的復(fù)述資源在問答、信息檢索、機(jī)器翻譯、文本生成等自然語言處理任務(wù)中都有重要的應(yīng)用,能夠提升相關(guān)自然語言處理系統(tǒng)性能。在本文基于上下文分析的詞及短語復(fù)述抽取研究中,主要包含以下三個(gè)方面的研究?jī)?nèi)容:基于上下文分析的詞匯級(jí)復(fù)述抽取方法研究、基于樞軸法的短語級(jí)復(fù)述抽取方法研究以及基于上下文分析的短語級(jí)復(fù)述抽取方法研究。首先,本文提出了基于上下文分析的詞匯復(fù)述抽取方法。目前詞匯復(fù)述抽取研究中主要是基于樞軸法從雙語平行語料中抽取詞匯復(fù)述。本文使用樞軸法的思想,使用中文詞匯的在線翻譯資源來抽取候選詞匯復(fù)述,從而避免雙語平行語料的對(duì)齊錯(cuò)誤而導(dǎo)致抽取得到錯(cuò)誤復(fù)述。使用詞匯上下文來學(xué)習(xí)詞匯向量,結(jié)合前饋神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)得到的詞匯復(fù)述得分以及詞向量之間的相似度得分作為詞匯復(fù)述的最終得分,使用最終得分對(duì)詞匯復(fù)述資源進(jìn)行排序過濾。使用上下文等信息對(duì)詞匯復(fù)述資源過濾可以減少因?yàn)橥馕姆g一詞多義而導(dǎo)致抽取得到的錯(cuò)誤復(fù)述。對(duì)抽取得到的詞匯復(fù)述資源進(jìn)行人工評(píng)價(jià)結(jié)果顯示該方法抽取得到的詞匯復(fù)述資源質(zhì)量?jī)?yōu)于傳統(tǒng)樞軸法抽取得到的詞匯復(fù)述資源。其次,在目前常用的基于樞軸法抽取短語復(fù)述的基礎(chǔ)上,本文針對(duì)該方法因?yàn)殡p語對(duì)齊錯(cuò)誤以及外文翻譯多義問題導(dǎo)致抽取得到錯(cuò)誤短語復(fù)述的問題,對(duì)抽取得到的短語復(fù)述資源分別使用翻譯概率以及上下文信息進(jìn)行過濾。實(shí)驗(yàn)結(jié)果表明,使用上下文信息對(duì)候選短語復(fù)述資源進(jìn)行過濾可以大幅提升抽取得到的短語復(fù)述資源質(zhì)量。最后,本文提出基于上下文分析的短語復(fù)述抽取方法。該方法中使用兩層Bi LSTM-CRF模型對(duì)中文單語語料進(jìn)行短語劃分,然后使用深度學(xué)習(xí)模型學(xué)習(xí)短語的向量表示,將短語向量的余弦相似度值高的短語抽取作為候選短語復(fù)述。并使用詞匯的英文翻譯對(duì)這些候選短語復(fù)述進(jìn)行過濾。提出短語上下文向量學(xué)習(xí)方法,使用短語上下文向量相似度對(duì)候選短語復(fù)述進(jìn)行排序。實(shí)驗(yàn)結(jié)果表明,神經(jīng)網(wǎng)絡(luò)模型可以學(xué)習(xí)短語語義向量表示,經(jīng)過過濾排序之后的短語復(fù)述資源質(zhì)量遠(yuǎn)高于基于樞軸法抽取得到的短語復(fù)述質(zhì)量。
[Abstract]:In real life, people often use different text descriptions when expressing the same information, which is the phenomenon of repetition. Because of the existence of retelling phenomenon, many natural language processing tasks become complex and difficult. Word and phrase retelling extraction is the extraction of words and phrases that express the same semantics from the corpus. The extracted retelling resources have important applications in natural language processing tasks such as question and answer, information retrieval, machine translation, text generation and so on. Can improve the performance of related natural language processing systems. In this paper, the extraction of words and phrases based on context analysis mainly includes the following three aspects: the research of lexical level extraction method based on context analysis. Research on phrase level repetition extraction method based on pivot method and phrase level repeat extraction method based on context analysis. Firstly, this paper proposes a lexical repetition extraction method based on context analysis. At present, lexical repetition extraction is mainly based on pivot method from bilingual parallel corpus. In this paper, the idea of pivot method is used to extract candidate lexical repetition using online translation resources of Chinese vocabulary, so as to avoid the alignment error of bilingual parallel corpus and result in error repetition of extraction. Vocabulary vector is learned by lexical context, and the score of word repetition and the similarity between word vectors are used as the final score of vocabulary retelling, which is based on feedforward neural network learning. The final score is used to sort and filter the word retelling resources. Using contextual information to filter lexical retelling resources can reduce the misrepresentation caused by the polysemy of foreign language translation. The results of manual evaluation of the extracted lexical repetition resources show that the quality of the lexical repetition resources extracted by this method is superior to that of the lexical repetition resources extracted by the traditional pivot method. Secondly, on the basis of the pivot method, this paper aims at the problem that the paraphrase can be extracted by this method because of the error of bilingual alignment and the polysemy of foreign language translation. The extracted phrase repetition resources are filtered using translation probability and context information respectively. The experimental results show that using context information to filter candidate phrase recitation resources can greatly improve the quality of extracted phrase recitation resources. Finally, a method of phrase repetition extraction based on context analysis is proposed. In this method, the two-layer Bi LSTM-CRF model is used to divide the Chinese monolingual corpus, and then the advanced learning model is used to study the vector representation of the phrase. The phrase with high cosine similarity of the phrase vector is extracted as a candidate phrase. These candidate phrases are filtered by English translation. A learning method of phrase context vector is proposed, and the similarity of phrase context vector is used to sort candidate phrase retelling. The experimental results show that the neural network model can learn the expression of phrase semantic vector, and the quality of phrase repeat resource after filtering and sorting is much higher than that of phrase recitation based on pivot method.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 何賢江;何維維;左航;;一種句詞五特征融合模型的復(fù)述研究[J];四川大學(xué)學(xué)報(bào)(工程科學(xué)版);2012年06期

2 趙世奇;劉挺;李生;;基于自動(dòng)構(gòu)建語料庫(kù)的詞匯級(jí)復(fù)述研究[J];電子學(xué)報(bào);2009年05期

3 張玉潔,山本和英;漢語語句的自動(dòng)改寫[J];中文信息學(xué)報(bào);2003年06期

相關(guān)博士學(xué)位論文 前1條

1 張偉男;社區(qū)型問答中問句檢索關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2014年



本文編號(hào):1926421

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1926421.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶c38d1***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com