基于Web的大規(guī)模平行語料庫構(gòu)建方法研究
發(fā)布時(shí)間:2018-05-28 12:54
本文選題:Web信息挖掘 + 雙語平行語料庫。 參考:《蘇州大學(xué)》2012年碩士論文
【摘要】:大規(guī)模平行語料庫是機(jī)器翻譯、跨語言信息檢索等自然語言處理應(yīng)用的重要資源;ヂ(lián)網(wǎng)上存在著海量的多語言平行資源,以往的一些研究都致力于從一些多語網(wǎng)站中獲取平行(即互為翻譯)的單語網(wǎng)頁對(duì),進(jìn)而獲取平行語料。雖然許多機(jī)構(gòu)都已經(jīng)展開建設(shè)雙語平行語料庫的工作,但現(xiàn)有語料庫在數(shù)量、質(zhì)量以及領(lǐng)域覆蓋性等方面還不能滿足處理真實(shí)文本的需要。目前,學(xué)者發(fā)現(xiàn)在Web上雙語平行資源不僅存在于兩個(gè)平行的單語網(wǎng)頁對(duì)中,還存在于雙語混合網(wǎng)頁中,且存在于雙語混合網(wǎng)頁內(nèi)部的平行資源翻譯質(zhì)量更高、數(shù)據(jù)規(guī)模更大、領(lǐng)域覆蓋更廣。本文的研究就是基于雙語混合網(wǎng)頁展開,致力于研究如何自動(dòng)構(gòu)建一個(gè)大規(guī)模雙語平行語料庫。取得的主要成果歸納如下: (?)探索基于Web獲取雙語混合網(wǎng)頁 互聯(lián)網(wǎng)中索引了海量的網(wǎng)頁,如何準(zhǔn)確獲取雙語混合網(wǎng)頁是個(gè)充滿挑戰(zhàn)的任務(wù)。以往的研究都是采用限定目標(biāo)源的方法,即預(yù)先收集大量的源站點(diǎn)(比如英語學(xué)習(xí)網(wǎng)站、翻譯網(wǎng)站等),然后遞歸下載所有內(nèi)部網(wǎng)頁作為候選雙語混合網(wǎng)頁。但是該方法中源站點(diǎn)的選擇需要人工干預(yù),且獲取的網(wǎng)頁數(shù)量有限。為了克服這些缺點(diǎn),還有-些研究提出利用搜索引擎和啟發(fā)式信息自動(dòng)篩選得到候選源站點(diǎn),但得到的候選資源良莠不齊,會(huì)下載到大量噪音網(wǎng)頁。本文提出了一種借助搜索引擎和已獲取的小規(guī)模平行語料來遞歸地發(fā)現(xiàn)并獲取雙語混合網(wǎng)頁的方法,實(shí)驗(yàn)結(jié)果表明該方法能夠快速地、準(zhǔn)確地、持久地獲取高質(zhì)量的雙語混合網(wǎng)頁。 (?)改進(jìn)了雙語平行資源抽取、對(duì)齊技術(shù) 雙語混合網(wǎng)頁中不僅包含有用的雙語平行資源,還包含一些噪音信息,如廣告信息、導(dǎo)航信息等,而且平行資源的存在形式多種多樣,這些都給平行資源的抽取工作帶來困難;此外,平行資源中的詞匯量也大大超出雙語詞典的范圍,這又增加了平行資源對(duì)齊工作的難度。本文提出通過自動(dòng)學(xué)習(xí)平行資源在網(wǎng)頁中的存在形式的方法來抽取平行資源,并使用基于長度、雙語詞典、翻譯模型等方法來提高平行語料庫的質(zhì)量。
[Abstract]:Large-scale parallel corpus is an important resource for natural language processing applications such as machine translation, cross-language information retrieval and so on. There are a lot of multilingual parallel resources on the Internet. Some previous studies have been devoted to obtaining parallel (i.e. translation) single language page pairs from some multilingual websites, and then to obtain parallel corpus. Although many institutions have begun to build bilingual parallel corpus, the existing corpus can not meet the needs of processing real text in terms of quantity, quality and domain coverage. At present, scholars have found that bilingual parallel resources not only exist in two parallel monolingual webpage pairs, but also exist in bilingual mixed web pages on Web, and the translation quality and data scale of parallel resources in bilingual mixed web pages are higher. The scope is wider. This paper is based on bilingual mixed web pages, dedicated to the study of how to automatically build a large scale bilingual parallel corpus. The main results achieved are summarized below: ) Exploring the acquisition of Bilingual mixed Web pages based on Web There are a lot of web pages indexed on the Internet, so how to obtain bilingual mixed web pages accurately is a challenging task. Previous studies have used the method of limiting target sources, that is, collecting a large number of source sites (such as English learning sites, translation sites, etc.) in advance, and then recursively downloading all internal pages as candidates for bilingual mixed pages. However, the selection of source sites in this method requires manual intervention, and the number of web pages obtained is limited. In order to overcome these shortcomings, some studies have proposed to use search engines and heuristic information to automatically filter candidate source sites, but the candidate resources are intermingled and downloaded to a large number of noisy pages. In this paper, a method of recursively discovering and obtaining bilingual mixed web pages by means of search engine and acquired small-scale parallel corpus is proposed. The experimental results show that this method can be used quickly and accurately. Persistent access to high quality bilingual mixed web pages. ) Improved bilingual parallel resource extraction and alignment techniques Bilingual mixed web pages not only contain useful bilingual parallel resources, but also contain some noise information, such as advertising information, navigation information, etc. In addition, the vocabulary of parallel resources is far beyond the scope of bilingual dictionaries, which makes it more difficult to align parallel resources. This paper proposes a method to extract parallel resources by automatically learning the existence of parallel resources in web pages, and to improve the quality of parallel corpus by using methods based on length, bilingual dictionaries and translation models.
【學(xué)位授予單位】:蘇州大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.09
【引證文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 王敏;關(guān)于高校網(wǎng)絡(luò)開放課程發(fā)展現(xiàn)狀的調(diào)研報(bào)告[D];上海師范大學(xué);2013年
,本文編號(hào):1946771
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1946771.html
最近更新
教材專著