基于命名實(shí)體的維漢翻譯規(guī)則及資源建設(shè)研究
發(fā)布時(shí)間:2018-07-31 11:10
【摘要】:新疆少數(shù)民族地區(qū)隨著教育的普及,人民的受教育水平的逐步提高,使得新疆少數(shù)民族對信息媒體的需求逐日增加,維吾爾文字形式發(fā)布的網(wǎng)站數(shù)目也在逐年增加。新疆新聞信息網(wǎng)站主要包括政治、經(jīng)濟(jì)、軍事、外交等社會公共事條報(bào)道,及社會突發(fā)事件的報(bào)道和評論。據(jù)了解新疆雙語新聞媒體(包括政府各類文件等)在涉及到關(guān)于財(cái)經(jīng)、日期、時(shí)間等方面的數(shù)字翻譯問題時(shí),翻譯的準(zhǔn)確率較低。然而面對海量信息,獲取準(zhǔn)確的信息數(shù)據(jù)不僅是研究人員要解決的問題,同樣是政府工作人員及查閱信息者的需求。網(wǎng)頁新聞數(shù)據(jù)及政府文獻(xiàn)中數(shù)字短語的正確翻譯是統(tǒng)計(jì)機(jī)器翻譯中一個(gè)重要的環(huán)節(jié)。以此為出發(fā)點(diǎn),本論文的主要研究工作如下: 第一:本文首先收集實(shí)驗(yàn)所需的維漢雙語平行語料,并進(jìn)行整理加工。語料的收集主要來源是從新疆新聞網(wǎng)站上下載。 第二:將數(shù)字和時(shí)間、日期等命名實(shí)體進(jìn)行詳細(xì)的分類。本文在分析維漢兩種語言中數(shù)字和時(shí)間等詞語構(gòu)成規(guī)律的基礎(chǔ)上,對其進(jìn)行類別劃分。 第三:人工編寫維漢數(shù)字識別和翻譯規(guī)則。針對語料中出現(xiàn)的數(shù)字、時(shí)間、日期等表達(dá)式編寫規(guī)則,是本論文的核心。 本文的創(chuàng)新點(diǎn)在于,目前國內(nèi)己出現(xiàn)了影響較大的在線翻譯系統(tǒng),如百度、谷歌和有道等,但他們只能實(shí)現(xiàn)大語種間的互譯,而沒有實(shí)現(xiàn)少數(shù)民族語言與其他語種間的翻譯,更不用提維吾爾語到漢語數(shù)字短語的翻譯。本文采用基于規(guī)則的方法實(shí)現(xiàn)了維吾爾文到中文的數(shù)字與時(shí)間表達(dá)式的翻譯。 本文的實(shí)驗(yàn)結(jié)果表明,對數(shù)字和時(shí)間等命名實(shí)體采用編寫規(guī)則的方法可以有效地提高短語翻譯概率表,從而明顯提高了翻譯質(zhì)量。在今后的工作中,將進(jìn)一步研究如何在統(tǒng)計(jì)機(jī)器翻譯中能更好地發(fā)揮規(guī)則的方法并完善和擴(kuò)展。
[Abstract]:With the popularization of education and the gradual improvement of the education level of the people in Xinjiang minority areas, the demand for information media is increasing day by day, and the number of websites published in the form of Uygur language is increasing year by year. The news information websites in Xinjiang mainly include political, economic, military, diplomatic and other social public affairs reports, As well as the reports and comments on social emergencies, it is understood that the accuracy of the translation is low when the bilingual news media of Xinjiang (including various government documents, etc.) is involved in the problem of digital translation concerning finance, date and time. However, to obtain accurate information from the massive information, it is not only a problem to be solved by the researchers, but also the problem that the researchers should solve. This is an important part of the statistical Machine Translation. The main research work of this paper is as follows:
Firstly, this paper collects the parallel Uygur-Chinese bilingual corpus for the experiment, which is downloaded from Xinjiang news website.
Second: make a detailed classification of the named entities, such as the number and time, date and so on. On the basis of the analysis of the constitution of the numbers and the time and other words in the two languages of the Han Dynasty, this paper divides them into categories.
Thirdly, the rules of Uygur-Chinese numeral recognition and translation are written manually. The core of this paper is to write rules of numeral, time, date and other expressions in the corpus.
The innovation point of this paper is that there have been a large number of online translation systems in China, such as Baidu, Google and Tao, but they can only translate between languages in large languages, do not translate between minority languages and other languages, not to mention the translation of Uygur to Chinese digital phrases. The method realizes the translation of digital and temporal expressions from Uighur to Chinese.
The experimental results of this paper show that the method of writing rules for the named entities such as digital and time can effectively improve the phrase translation probability table and improve the quality of translation obviously. In the future work, we will further study how to improve and expand the rule method in the statistical Machine Translation.
【學(xué)位授予單位】:西北民族大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:H215;H085
本文編號:2155399
[Abstract]:With the popularization of education and the gradual improvement of the education level of the people in Xinjiang minority areas, the demand for information media is increasing day by day, and the number of websites published in the form of Uygur language is increasing year by year. The news information websites in Xinjiang mainly include political, economic, military, diplomatic and other social public affairs reports, As well as the reports and comments on social emergencies, it is understood that the accuracy of the translation is low when the bilingual news media of Xinjiang (including various government documents, etc.) is involved in the problem of digital translation concerning finance, date and time. However, to obtain accurate information from the massive information, it is not only a problem to be solved by the researchers, but also the problem that the researchers should solve. This is an important part of the statistical Machine Translation. The main research work of this paper is as follows:
Firstly, this paper collects the parallel Uygur-Chinese bilingual corpus for the experiment, which is downloaded from Xinjiang news website.
Second: make a detailed classification of the named entities, such as the number and time, date and so on. On the basis of the analysis of the constitution of the numbers and the time and other words in the two languages of the Han Dynasty, this paper divides them into categories.
Thirdly, the rules of Uygur-Chinese numeral recognition and translation are written manually. The core of this paper is to write rules of numeral, time, date and other expressions in the corpus.
The innovation point of this paper is that there have been a large number of online translation systems in China, such as Baidu, Google and Tao, but they can only translate between languages in large languages, do not translate between minority languages and other languages, not to mention the translation of Uygur to Chinese digital phrases. The method realizes the translation of digital and temporal expressions from Uighur to Chinese.
The experimental results of this paper show that the method of writing rules for the named entities such as digital and time can effectively improve the phrase translation probability table and improve the quality of translation obviously. In the future work, we will further study how to improve and expand the rule method in the statistical Machine Translation.
【學(xué)位授予單位】:西北民族大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:H215;H085
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 張亞軍;;漢語-維吾爾語機(jī)器翻譯解碼器研究[J];昌吉學(xué)院學(xué)報(bào);2011年03期
2 劉群;統(tǒng)計(jì)機(jī)器翻譯綜述[J];中文信息學(xué)報(bào);2003年04期
3 黃成哲,張曉光,李向宏,王丁;英文句子邊界自動識別[J];微處理機(jī);2003年01期
相關(guān)博士學(xué)位論文 前4條
1 蔣宏飛;基于同步樹替換文法的統(tǒng)計(jì)機(jī)器翻譯方法研究[D];哈爾濱工業(yè)大學(xué);2010年
2 劉水;融入頭—修飾詞調(diào)序模型的短語統(tǒng)計(jì)機(jī)器翻譯方法研究[D];哈爾濱工業(yè)大學(xué);2011年
3 劉宇鵬;機(jī)器翻譯中系統(tǒng)融合技術(shù)的研究[D];哈爾濱工業(yè)大學(xué);2011年
4 王博;機(jī)器翻譯系統(tǒng)的自動評價(jià)及診斷方法研究[D];哈爾濱工業(yè)大學(xué);2010年
,本文編號:2155399
本文鏈接:http://sikaile.net/wenyilunwen/yuyanxuelw/2155399.html
最近更新
教材專著