融合語言差異性的漢—越統(tǒng)計機器翻譯方法研究
本文選題:統(tǒng)計機器翻譯 + 漢語-越南語; 參考:《昆明理工大學(xué)》2017年碩士論文
【摘要】:越南是一個重要的東南亞國家且與中國接壤,并一直與中國存在頻繁的政治、經(jīng)濟交往。機器翻譯是自然語言處理研究的重要分支之一。研究漢語-越南語統(tǒng)計機器翻譯對漢越雙語理解、信息檢索、文化交流、經(jīng)濟貿(mào)易等具有重要的支撐作用。當(dāng)前漢語到越南語的翻譯模型還處于起步階段,一些主要工作集中在雙語平行語料庫構(gòu)建、針對漢越的詞對齊方法研究、以及越南語的依存句法樹等方面。由于互聯(lián)網(wǎng)上存在比較少量的漢語到越南語的平行語料,通過稀疏語料訓(xùn)練的翻譯模型很難覆蓋比較全面的語言知識,其次由于缺乏語言差異性指導(dǎo),致使翻譯模型和解碼算法完全依賴語料庫規(guī)模,增加了引入錯誤的概率。因此將語言差異性融入進漢越翻譯模型是一個有待研究的難點問題。越南語和漢語的語言特征既有相同點又有不同點。相同點都遵循主謂賓結(jié)構(gòu),不同點在于,越南語中修飾語(定語和狀語等等)和被修飾語的位置與漢語成后置關(guān)系,即越南語中的形容詞位于其修飾的名詞之后,副詞位于其修飾的形容詞和動詞之后;谝陨戏治,本文從層次短語模型和句法樹到樹模型,融合語言差異性進行建模與研究:(1)詞匯化模型中融合語言差異性的層次短語翻譯模型。首先,分別使用中科院中文詞性標注和分詞工具和越南語分詞工具對漢語和越南語雙語平行句對進行分詞以及標注,通過GIZA++得到雙語的詞對齊信息。然后利用詞對齊信息,抽取出最初短語對,泛化成帶有非終結(jié)符的規(guī)則,然后訓(xùn)練得到層次短語翻譯模型。其次通過分析漢語與越南語的差異性,進行語言特性的形式化定義,并將其融入詞匯化調(diào)序模型中。解碼使用CKY算法。在實驗中,觀察詞匯化模型中融合語言差異性的層次短語翻譯模型,以及常規(guī)層次短語模型在不同文法的語言模型下的對比,實驗結(jié)果表明詞匯化模型中融合語言差異性的層次短語翻譯模型提高了翻譯效果。(2)融合語言特性的句法樹到樹翻譯模型的漢-越統(tǒng)計機器翻譯方法。首先進行句法樹解析,生成雙語句法樹,其次通過GIZA++得到詞對齊,通過一一對應(yīng)的句法樹,提取規(guī)則對,構(gòu)建規(guī)則庫。并利用短語翻譯模型的豐富短語對,對源語言與目標語言的解析樹進行泛化,擴大規(guī)則庫。其次利用有效的語言差異特性對規(guī)則預(yù)處理以及翻譯模型的調(diào)優(yōu)。解碼過程使用樹解析算法,并利用目標語言的泛化指導(dǎo)候選翻譯生成。在實驗中,觀察詞匯化模型中融合語言差異性的層次短語,句法樹到樹,融合語言特性的樹到樹模型的BLUE值。實驗結(jié)果表明提出的方法有效的提高了規(guī)則庫規(guī)模的同時提高了翻譯的準確性。(3)在融合語言差異性的漢-越句法樹到樹翻譯模型的原型系統(tǒng)。在基于句法樹到樹翻譯系統(tǒng)的,將漢語和越南語的語言差異特性作為特征融入規(guī)則庫的優(yōu)化和翻譯模型的建模階段,其次系統(tǒng)構(gòu)建過程中使用了一些開源的工具和框架,Niutrans翻譯框架,中科院分詞與標注工具,GIZA++等。系統(tǒng)的前臺搭建使用Java Servlet技術(shù),通過翻譯模型解碼所翻譯的句子,最終構(gòu)建了融合語言差異性的漢-越句法樹到樹翻譯模型的原型系統(tǒng)。
[Abstract]:Vietnam is an important Southeast Asian country and is contiguous with China, and has always existed frequently with China in political and economic exchanges. Machine Translation is one of the important branches of Natural Language Processing research. The study of Chinese Vietnamese statistics Machine Translation has important support for the bilingual understanding, information retrieval, cultural exchange and economic trade of the Chinese Vietnamese. Use. The current translation model of Chinese to Vietnamese is still in its infancy, and some of the main tasks are focused on the construction of bilingual parallel corpus, the study of the word alignment method of Han Yue, and the dependency syntax tree of the Vietnamese language. Because there are a few parallel corpus in the Vietnamese language on the Internet, it is trained through sparse corpus. The translation model is difficult to cover more comprehensive language knowledge. Secondly, due to the lack of language difference guidance, the translation model and decoding algorithm depend entirely on the size of the corpus and increase the probability of introducing errors. Therefore, the integration of language differences into the Han Yue translation model is a difficult problem to be studied. The same points have both the same points and different points. The same points all follow the subject predicate object structure. The difference is that the modifier (attributive and adverbial and so on) and the position of the modifier have a postposition relationship with the Chinese, that is, after the adjective in the Vietnamese language is located in its modified noun, the adverb is located after its modified adjective and verb. From the hierarchical phrase model and the syntactic tree to the tree model, this paper combines language differences to model and study: (1) a hierarchical phrase translation model is fused in the lexicalization model. First, the Chinese and Vietnamese bilingual parallel sentences are used by the Chinese Academy of Chinese word tagging and participle and the Vietnamese word segmentation tool. We use word segmentation and tagging, get the bilingual word alignment information through GIZA++, then use words to align information, draw out the initial phrase pairs, generalize the rules with non terminations, and then train the hierarchical phrase translation model. Secondly, the formal definition of language characteristics is carried out by analyzing the differences between Chinese and Vietnamese. In the lexicalization model, the decoding uses the CKY algorithm. In the experiment, we observe the hierarchical phrase translation model in the lexicalization model, and the contrast of the conventional hierarchical phrase model under the different grammatical language models. The experimental results show that the hierarchical phrase translation model of the linguistic difference is improved in the lexicalization model. The translation effect. (2) the syntactic tree which combines the language characteristics to the Han - Yue Machine Translation method of the tree translation model. First, the syntactic tree is parsed and the bilingual syntactic tree is generated. Secondly, the word alignment is obtained by GIZA++, and the rules pair is extracted by the one-to-one corresponding syntax tree, and the rule base is constructed. The analysis tree of the source language and the target language is generalized, and the rule base is extended. Secondly, the rule preprocessing and the optimization of the translation model are used by the effective language difference characteristics. The decoding process uses the tree analysis algorithm, and uses the generalization of the target language to guide the generation of the candidate translation. In the experiment, we observe the fusion of language differences in the lexicalization model. The hierarchical phrase, the syntax tree to the tree, the BLUE value of the tree to the tree model, the experimental results show that the proposed method improves the size of the rule base effectively while improving the accuracy of the translation. (3) the prototype system based on the syntactic tree to the tree translation system is based on the syntactic tree to the tree translation model. To integrate the characteristics of Chinese and Vietnamese language differences into the optimization of the rule base and the modeling stage of the translation model. Secondly, some open source tools and frameworks are used in the process of the system construction, the Niutrans translation framework, the Chinese Academy of Sciences participle and tagging tools, GIZA++ and so on. The front desk of the system uses the Java Servlet technology, through the turn over. Finally, the prototype system of Chinese Vietnamese syntactic tree translation to tree translation model is constructed.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.2
【參考文獻】
相關(guān)期刊論文 前6條
1 肖欣延;劉洋;劉群;林守勛;;面向?qū)哟味陶Z翻譯的詞匯化調(diào)序方法研究[J];中文信息學(xué)報;2012年01期
2 熊德意;劉群;林守勛;;基于句法的統(tǒng)計機器翻譯綜述[J];中文信息學(xué)報;2008年02期
3 范青釧;;漢越狀語語序比較分析[J];云南師范大學(xué)學(xué)報(對外漢語教學(xué)與研究版);2007年06期
4 武氏河;越南語與漢語的句法語序比較[J];云南師范大學(xué)學(xué)報;2005年06期
5 劉群;統(tǒng)計機器翻譯綜述[J];中文信息學(xué)報;2003年04期
6 俞士汶,段慧明,朱學(xué)鋒,孫斌;北京大學(xué)現(xiàn)代漢語語料庫基本加工規(guī)范[J];中文信息學(xué)報;2002年05期
相關(guān)博士學(xué)位論文 前1條
1 肖桐;樹到樹統(tǒng)計機器翻譯優(yōu)化學(xué)習(xí)及解碼方法研究[D];東北大學(xué);2012年
相關(guān)碩士學(xué)位論文 前3條
1 呂昌濤;基于語言特性的漢—越短語機器翻譯方法研究[D];昆明理工大學(xué);2016年
2 周云;漢語越南語機器翻譯實驗系統(tǒng)[D];中國人民解放軍外國語學(xué)院;2006年
3 何氏紅鳳;漢語和越南語“定語”的對比分析[D];華中師范大學(xué);2006年
,本文編號:1798227
本文鏈接:http://sikaile.net/jingjilunwen/zhengzhijingjixuelunwen/1798227.html