神經(jīng)網(wǎng)絡機器翻譯中未登錄詞處理方法研究
發(fā)布時間:2019-05-24 04:13
【摘要】:神經(jīng)網(wǎng)絡機器翻譯(neural machine translation,NMT)是一種新的基于編碼-解碼網(wǎng)絡框架的機器翻譯模型,其在各種翻譯任務中都表現(xiàn)出了遠遠優(yōu)于傳統(tǒng)方法的性能。由于GPU內(nèi)存和計算時間的限制,NMT只能維持一個包含最頻繁詞的相對有限的詞表,詞表外的未登錄詞(out of vocabulary,OOV)通常被表示為一個符號unk。其中源端句子中出現(xiàn)的unk會增加翻譯的歧義性,同時NMT本身也無法處理翻譯結果中的unk,只能借助一個額外的后處理方法。本課題針對OOV所帶來的問題,把NMT的翻譯過程分為“預處理”,“模型中”,“后處理”三個階段,分別在這三個階段對未登錄詞的處理方法進行了研究。首先在“后處理”階段,本文針對現(xiàn)有的NMT中OOV后處理方法的缺點,提出了一種基于上下文的信息的NMT未登錄詞后處理方法。該方法首先為unk構造了多個未登錄候選詞,為每一個候選詞提取了多個角度的上下文特征,之后通過一個pairwise的排序學習模型選擇出最適合的OOV替換翻譯結果中的unk。實驗結果表明我們的方法可以顯著地提高翻譯結果中的OOV召回率。其次在“預處理”階段,本文針對NMT中OOV產(chǎn)生的歧義問題,嘗試使用相似詞和聚類信息2種不同粒度的語義單元對OOV進行表示。我們在預處理階段使用語義表示對NMT的訓練和測試語料中的OOV進行替換,使用替換后的語料分別進行NMT的訓練和測試,并在測試完成后恢復之前替換的翻譯結果。實驗結果表明使用詞類預處理OOV可以明顯地提升翻譯質(zhì)量。最后在“模型中”階段,本文提出了一種OOV的層次聚類詞向量的方法。我們使用聚類方法為OOV建立一個層次的語義表示,并把它嵌入到了NMT的模型中。這種層次的結構不僅可以在源端為OOV消除歧義,而且能為目標端的unk利用NMT中的上下文信息選擇翻譯詞。同時我們引入的聚類向量還能緩解OOV的稀疏問題。實驗結果表明模型在中-英翻譯任務上比Baseline提升了1.43到2.06個BLEU值。
[Abstract]:Neural network machine translation (neural machine translation,NMT) is a new machine translation model based on coding-decoding network framework, which shows much better performance than the traditional methods in all kinds of translation tasks. Due to the limitations of GPU memory and computing time, NMT can only maintain a relatively limited list of words containing the most frequent words. The unlogged word (out of vocabulary,OOV outside the vocabulary is usually represented as a symbol unk.. The unk in the source sentence will increase the ambiguity of translation, and NMT itself can not deal with the unk, in the translation results with the help of an additional post-processing method. In order to solve the problems caused by OOV, the translation process of NMT is divided into three stages: "preprocessing", "model" and "post-processing". In these three stages, the processing methods of unknown words are studied respectively. First of all, in the "post-processing" stage, aiming at the shortcomings of the existing OOV post-processing methods in NMT, this paper proposes a context-based information based NMT unlogged word post-processing method. In this method, multiple unlogged candidate words are constructed for unk, and the context features of multiple angles are extracted for each candidate word, and then the most suitable OOV to replace the unk. in the translation result is selected by a pairwise sort learning model. The experimental results show that our method can significantly improve the OOV recall rate in translation results. Secondly, in the stage of "preprocessing", aiming at the ambiguity caused by OOV in NMT, this paper attempts to use two different granularity semantic units of similar words and clustering information to represent OOV. In the preprocessing phase, we use semantic representation to replace OOV in NMT training and test corpus, and use the replaced corpus to train and test NMT respectively, and replace the translation results before recovery after the test is completed. The experimental results show that the use of part-of-speech preprocessing OOV can significantly improve the translation quality. Finally, in the "model" stage, this paper proposes a hierarchical clustering word vector method for OOV. We use clustering method to establish a hierarchical semantic representation for OOV and embed it in the model of NMT. This hierarchical structure can not only eliminate ambiguity for OOV on the source side, but also select translation words for unk on the target side by using the context information in NMT. At the same time, the clustering vector introduced by us can also alleviate the sparse problem of OOV. The experimental results show that the BLEU value of the model is 1.43 to 2.06 higher than that of Baseline in Chinese-English translation tasks.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
[Abstract]:Neural network machine translation (neural machine translation,NMT) is a new machine translation model based on coding-decoding network framework, which shows much better performance than the traditional methods in all kinds of translation tasks. Due to the limitations of GPU memory and computing time, NMT can only maintain a relatively limited list of words containing the most frequent words. The unlogged word (out of vocabulary,OOV outside the vocabulary is usually represented as a symbol unk.. The unk in the source sentence will increase the ambiguity of translation, and NMT itself can not deal with the unk, in the translation results with the help of an additional post-processing method. In order to solve the problems caused by OOV, the translation process of NMT is divided into three stages: "preprocessing", "model" and "post-processing". In these three stages, the processing methods of unknown words are studied respectively. First of all, in the "post-processing" stage, aiming at the shortcomings of the existing OOV post-processing methods in NMT, this paper proposes a context-based information based NMT unlogged word post-processing method. In this method, multiple unlogged candidate words are constructed for unk, and the context features of multiple angles are extracted for each candidate word, and then the most suitable OOV to replace the unk. in the translation result is selected by a pairwise sort learning model. The experimental results show that our method can significantly improve the OOV recall rate in translation results. Secondly, in the stage of "preprocessing", aiming at the ambiguity caused by OOV in NMT, this paper attempts to use two different granularity semantic units of similar words and clustering information to represent OOV. In the preprocessing phase, we use semantic representation to replace OOV in NMT training and test corpus, and use the replaced corpus to train and test NMT respectively, and replace the translation results before recovery after the test is completed. The experimental results show that the use of part-of-speech preprocessing OOV can significantly improve the translation quality. Finally, in the "model" stage, this paper proposes a hierarchical clustering word vector method for OOV. We use clustering method to establish a hierarchical semantic representation for OOV and embed it in the model of NMT. This hierarchical structure can not only eliminate ambiguity for OOV on the source side, but also select translation words for unk on the target side by using the context information in NMT. At the same time, the clustering vector introduced by us can also alleviate the sparse problem of OOV. The experimental results show that the BLEU value of the model is 1.43 to 2.06 higher than that of Baseline in Chinese-English translation tasks.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
【相似文獻】
相關期刊論文 前10條
1 袁穎芬 ,張sソ,
本文編號:2484549
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2484549.html
最近更新
教材專著