專業(yè)領(lǐng)域可比語料的構(gòu)建與評價研究
發(fā)布時間:2019-01-02 15:28
【摘要】:雙語詞典、平行語料庫等多語言資源是解決跨語言障礙,進(jìn)行多語言信息處理與服務(wù)的基礎(chǔ)資源,同時這些資源在某些領(lǐng)域或語種內(nèi)也是稀缺資源,存在著獲取瓶頸問題。相比之下,可比語料不存在平行語料里譯文受原文限制的缺點(diǎn),容易獲取,并且從中提取的雙語詞對可用來擴(kuò)充雙語詞典,因此可比語料的構(gòu)建研究是一項(xiàng)很有意義的研究工作。一方面,可以豐富語料構(gòu)建的理論體系,另一方面,可以為多語言信息處理提供豐富并且可用的多語言語料資源,F(xiàn)有的可比語料庫構(gòu)建主要針對新聞等通用領(lǐng)域,但實(shí)際應(yīng)用中有關(guān)專業(yè)領(lǐng)域可比語料的應(yīng)用需求也非常迫切;并且由于專業(yè)領(lǐng)域和通用領(lǐng)域的語料特點(diǎn)存在諸多不同,使得通用領(lǐng)域的可比語料構(gòu)建和評價方法及技術(shù)并不一定適用于專業(yè)領(lǐng)域的可比語料研究;诖,本文對專業(yè)領(lǐng)域可比語料構(gòu)建及評價問題進(jìn)行研究,探索中英領(lǐng)域可比語料的采集方法,并以跨語言相似度為基礎(chǔ)引入主題維度進(jìn)行語料可比度度量研究,最后通過內(nèi)部評價和外部評價對可比語料的質(zhì)量進(jìn)行綜合評估。 在中英領(lǐng)域可比語料的采集研究中,本文分別以Web搜索引擎、在線百科全書、中英文學(xué)術(shù)數(shù)據(jù)庫等三種不同類型的互聯(lián)網(wǎng)資源作為數(shù)據(jù)源,進(jìn)行專業(yè)領(lǐng)域可比語料庫的構(gòu)建,并對這些方法進(jìn)行比較分析。 在語料可比度度量研究中,本文以詞語為單元,通過基于傳統(tǒng)統(tǒng)計(jì)的序列相似度(包括卡方統(tǒng)計(jì)、spearman系數(shù))、基于詞頻排序的序列相似度、基于術(shù)語度排序的序列相似度等三種不同方法在不同類型語料(平行語料、可比語料、非可比語料等)進(jìn)行實(shí)驗(yàn),對語料整體進(jìn)行可比度度量。結(jié)果表明:基于術(shù)語度排序的方法性能最好,其次是基于詞頻的方法,基于傳統(tǒng)統(tǒng)計(jì)的方法性能最差。 此外,關(guān)于可比語料研究大多采用單一指標(biāo),尚未形成較完善統(tǒng)一的評價體系,需要對可比語料的評價進(jìn)行深入研究。鑒于此,本文從內(nèi)部評價和外部評價兩方面對語料進(jìn)行綜合評估。內(nèi)部評價中以語料詞語總體特征、子語料相似性等為基礎(chǔ)進(jìn)行語料內(nèi)部一致性的評估;外部評價中通過雙語術(shù)語抽取任務(wù)間接評價語料質(zhì)量。在不同可比程度的語料(包括平行語料、可比語料、非可比語料)上的雙語術(shù)語抽取實(shí)驗(yàn)結(jié)果表明,可比度高的語料上獲取的術(shù)語質(zhì)量更高。
[Abstract]:Bilingual dictionaries, parallel corpus and other multilingual resources are the basic resources to solve the cross-language barriers and multilingual information processing and service. At the same time, these resources are also scarce resources in some fields or languages. In contrast, the comparable corpus does not have the disadvantage that the translation of the parallel corpus is restricted by the original text, and it is easy to obtain, and the bilingual pairs extracted from it can be used to expand the bilingual dictionary. Therefore, the construction of comparable corpus is a meaningful research work. On the one hand, it can enrich the theoretical system of corpus construction, on the other hand, it can provide abundant and usable multilingual data resources for multilingual information processing. The existing comparable corpus construction is mainly aimed at the general field such as news, but the application demand of the professional domain comparable corpus is also very urgent in the practical application. Because there are many differences between professional domain and general domain, the methods and techniques of comparable corpus construction and evaluation in general domain are not necessarily suitable for the research of comparable corpus in professional field. Based on this, this paper studies the construction and evaluation of professional domain comparable corpus, explores the methods of collecting Chinese and English domain comparable data, and introduces topic dimension to study the measurement of corpus comparability on the basis of cross-language similarity. Finally, the quality of comparable corpus is evaluated synthetically by internal and external evaluation. In the research of Chinese and English domain comparable corpus, this paper uses Web search engine, online encyclopedia, Chinese and English academic database as data sources to construct the professional domain comparable corpus. These methods are compared and analyzed. In the research of Corpus comparability, this paper takes words as the unit, through the traditional statistical sequence similarity (including chi-square statistics, spearman coefficient), based on word frequency ranking sequence similarity. Three different methods, such as sequence similarity degree based on term degree ranking, are experimented in different types of corpus (parallel corpus, comparable corpus, non-comparable corpus, etc.) to measure the comparability of the whole corpus. The results show that the performance of the method based on term degree ranking is the best, followed by the method based on word frequency, and the method based on traditional statistics has the worst performance. In addition, most of the research on comparable corpus is based on a single index, which has not yet formed a perfect and unified evaluation system, so it is necessary to conduct in-depth research on the evaluation of comparable corpus. In view of this, this paper evaluates the corpus from two aspects: internal evaluation and external evaluation. The internal evaluation is based on the general characteristics of the corpus and the similarity of the sub-corpus. In the external evaluation, the quality of the corpus is indirectly evaluated through the task of extracting the bilingual terms. The experimental results of bilingual terminology extraction on different comparable data (including parallel data, comparable data and non-comparable data) show that the quality of the terms obtained on the high comparable data is higher.
【學(xué)位授予單位】:南京理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
本文編號:2398671
[Abstract]:Bilingual dictionaries, parallel corpus and other multilingual resources are the basic resources to solve the cross-language barriers and multilingual information processing and service. At the same time, these resources are also scarce resources in some fields or languages. In contrast, the comparable corpus does not have the disadvantage that the translation of the parallel corpus is restricted by the original text, and it is easy to obtain, and the bilingual pairs extracted from it can be used to expand the bilingual dictionary. Therefore, the construction of comparable corpus is a meaningful research work. On the one hand, it can enrich the theoretical system of corpus construction, on the other hand, it can provide abundant and usable multilingual data resources for multilingual information processing. The existing comparable corpus construction is mainly aimed at the general field such as news, but the application demand of the professional domain comparable corpus is also very urgent in the practical application. Because there are many differences between professional domain and general domain, the methods and techniques of comparable corpus construction and evaluation in general domain are not necessarily suitable for the research of comparable corpus in professional field. Based on this, this paper studies the construction and evaluation of professional domain comparable corpus, explores the methods of collecting Chinese and English domain comparable data, and introduces topic dimension to study the measurement of corpus comparability on the basis of cross-language similarity. Finally, the quality of comparable corpus is evaluated synthetically by internal and external evaluation. In the research of Chinese and English domain comparable corpus, this paper uses Web search engine, online encyclopedia, Chinese and English academic database as data sources to construct the professional domain comparable corpus. These methods are compared and analyzed. In the research of Corpus comparability, this paper takes words as the unit, through the traditional statistical sequence similarity (including chi-square statistics, spearman coefficient), based on word frequency ranking sequence similarity. Three different methods, such as sequence similarity degree based on term degree ranking, are experimented in different types of corpus (parallel corpus, comparable corpus, non-comparable corpus, etc.) to measure the comparability of the whole corpus. The results show that the performance of the method based on term degree ranking is the best, followed by the method based on word frequency, and the method based on traditional statistics has the worst performance. In addition, most of the research on comparable corpus is based on a single index, which has not yet formed a perfect and unified evaluation system, so it is necessary to conduct in-depth research on the evaluation of comparable corpus. In view of this, this paper evaluates the corpus from two aspects: internal evaluation and external evaluation. The internal evaluation is based on the general characteristics of the corpus and the similarity of the sub-corpus. In the external evaluation, the quality of the corpus is indirectly evaluated through the task of extracting the bilingual terms. The experimental results of bilingual terminology extraction on different comparable data (including parallel data, comparable data and non-comparable data) show that the quality of the terms obtained on the high comparable data is higher.
【學(xué)位授予單位】:南京理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 孫廣范;宋金平;袁琦;肖健;單玉秋;;中英可比語料庫中翻譯等價對抽取方法研究[J];計(jì)算機(jī)工程與應(yīng)用;2007年32期
2 張永臣;孫樂;李飛;李文波;西野文人;于浩;方高林;;基于Web數(shù)據(jù)的特定領(lǐng)域雙語詞典抽取[J];中文信息學(xué)報(bào);2006年02期
3 章成志;王惠臨;;多語言文本聚類研究綜述[J];現(xiàn)代圖書情報(bào)技術(shù);2009年06期
4 康小麗;章成志;王惠臨;;基于可比語料庫的雙語術(shù)語抽取研究述評[J];現(xiàn)代圖書情報(bào)技術(shù);2009年10期
5 馮志偉;;關(guān)于信息時代的多語言問題的一些思考[J];現(xiàn)代語文;2006年07期
相關(guān)碩士學(xué)位論文 前1條
1 于海濤;可比較語料庫的研究與構(gòu)建[D];大連理工大學(xué);2009年
,本文編號:2398671
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2398671.html
最近更新
教材專著