漢老雙語句子對齊方法研究
本文選題:漢語-老撾語 + 句子對齊。 參考:《昆明理工大學(xué)》2017年碩士論文
【摘要】:雙語語料庫存儲著兩種語言在語義上一致的語料資源和信息,是雙語語言處理領(lǐng)域的一個(gè)重要基礎(chǔ)資源,它被廣泛地應(yīng)用在機(jī)器翻譯、跨語言信息檢索、詞義消歧、翻譯知識提取等方面。對齊是處理雙語語料文本的核心,對齊的效果如何,直接關(guān)系著未來的自然語言處理相關(guān)工作。句子對齊,即是以句子級別為文本單位的文本對齊,是一種從雙語語料中尋找出語義上達(dá)到匹配的句子對關(guān)系的技術(shù)。本文根據(jù)漢語-老撾語雙語的語言特點(diǎn),著重于研究探討如何構(gòu)建漢老雙語平行語料庫、如何選取高質(zhì)量的漢老雙語文本特征及如何實(shí)現(xiàn)融入多特征的漢老雙語平行句對抽取等展開相關(guān)研究工作,主要完成了以下研究工作。(1)通過探索研究如何構(gòu)建雙語平行語料庫,考察分析以維基百科為主的多語言平臺中平行語料的分布情況,并制定了一套漢老雙語平行語料庫構(gòu)建策略,包括雙語語料爬取、正文提取、句子對齊等環(huán)節(jié)。(2)通過研究分析老撾語的語言特點(diǎn)、總結(jié)出漢老雙語句法結(jié)構(gòu)方面的異同點(diǎn),并以此為依據(jù),選取了一系列漢老雙語文本特征,包括、詞典匹配特征、詞共現(xiàn)率特征及數(shù)字特征等,為下一步的漢老雙語平行句對抽取工作做準(zhǔn)備。(3)通過深入探索如何實(shí)現(xiàn)漢老雙語平行句對抽取,本文提出了一種融入多特征的漢老雙語平行句對抽取方法。首先,對從以維基百科為主的多語言平臺中獲取的雙語語料進(jìn)行預(yù)處理,接著使用候選句對抽取方法獲得候選平行句對語料集,并通過融合上述文本特征訓(xùn)練支持向量機(jī)模型與最大熵模型。最后通過設(shè)計(jì)實(shí)驗(yàn)比較兩個(gè)分類器的抽取效果及每一個(gè)文本特征對對齊效果的影響,證明了支持向量機(jī)更為適合本方法,且全文本特征組合的準(zhǔn)確率達(dá)到了 70.46%,得到了可行且有效的漢老雙語平行句對抽取效果。
[Abstract]:The bilingual corpus stores semantically consistent corpus resources and information for both languages. It is an important basic resource in the field of bilingual language processing. It is widely used in machine translation, cross-language information retrieval, word sense disambiguation. Translation knowledge extraction and so on. Alignment is the core of bilingual text processing. The effect of alignment is directly related to the related work of natural language processing in the future. Sentence alignment, which is a kind of text alignment with sentence level as the text unit, is a technique to find out the semantic matched sentence pairs from the bilingual corpus. Based on the linguistic characteristics of Chinese and Lao languages, this paper focuses on how to construct a parallel corpus of Chinese and Lao bilinguals. How to select high quality Chinese and old bilingual text features and how to realize the extraction of bilingual parallel sentences with multiple features are carried out in this paper. The following research work is completed: 1) how to construct a bilingual parallel corpus by exploring how to construct a bilingual parallel corpus, and how to construct a bilingual parallel corpus by exploring how to construct a bilingual parallel corpus. This paper investigates and analyzes the distribution of parallel corpus in the multilingual platform which is based on Wikipedia, and formulates a set of strategies for constructing Chinese and old bilingual parallel corpora, including bilingual corpus crawling, text extraction, and so on. Sentence alignment and other links. (2) by studying and analyzing the language characteristics of Lao, the similarities and differences in the syntactic structure of Chinese and Lao are summarized, and a series of Chinese and old bilingual text features are selected, including dictionary matching features. In order to prepare for the extraction of Chinese and old bilingual parallel sentences in the next step, this paper explores how to realize the extraction of Chinese and old bilingual parallel sentence pairs through further exploring how to realize the extraction of Chinese and old bilingual parallel sentence pairs. In this paper, we propose a multi-feature Chinese-old parallel sentence pair extraction method. Firstly, we preprocess the bilingual corpus obtained from the multilingual platform which is based on Wikipedia, and then obtain the candidate parallel sentence pair corpus using candidate sentence pair extraction method. The support vector machine model and the maximum entropy model are trained by combining the above text features. Finally, by designing experiments to compare the extraction effect of two classifiers and the effect of each text feature on alignment effect, it is proved that support vector machine is more suitable for this method. The accuracy of full text feature combination is 70.46, and a feasible and effective Chinese and old bilingual parallel sentence extraction effect is obtained.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 龐偉;;雙語語料庫構(gòu)建研究綜述[J];信息技術(shù)與信息化;2015年03期
2 銀莎格;;國內(nèi)老撾語研究綜述[J];銅仁學(xué)院學(xué)報(bào);2014年01期
3 田生偉;禹龍;楊飛宇;;改進(jìn)的自適應(yīng)漢維句子對齊[J];計(jì)算機(jī)工程與應(yīng)用;2011年35期
4 才讓加;;面向自然語言處理的大規(guī)模漢藏(藏漢)雙語語料庫構(gòu)建技術(shù)研究[J];中文信息學(xué)報(bào);2011年06期
5 肖健;徐建;徐曉蘭;袁琦;;英中可比語料庫中多詞表達(dá)自動提取與對齊[J];計(jì)算機(jī)工程與應(yīng)用;2010年31期
6 張霞;昝紅英;張恩展;;漢英句子對齊長度計(jì)算方法的研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年18期
7 郝秀蘭;陶曉鵬;徐和祥;胡運(yùn)發(fā);;kNN文本分類器類偏斜問題的一種處理對策[J];計(jì)算機(jī)研究與發(fā)展;2009年01期
8 林智勇;郝志峰;楊曉偉;;不平衡數(shù)據(jù)分類的研究現(xiàn)狀[J];計(jì)算機(jī)應(yīng)用研究;2008年02期
9 劉超朋;;平行語料庫概述[J];燕山大學(xué)學(xué)報(bào)(哲學(xué)社會科學(xué)版);2007年S1期
10 郝曉燕;常曉明;;中文文本分類研究[J];太原理工大學(xué)學(xué)報(bào);2006年06期
相關(guān)碩士學(xué)位論文 前2條
1 盧文杰;老撾語和漢語量詞對比研究[D];廣西民族大學(xué);2013年
2 羅芳玲;漢語和老撾語句法比較研究[D];廣西民族大學(xué);2010年
,本文編號:1900600
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1900600.html