專(zhuān)業(yè)領(lǐng)域可比語(yǔ)料的構(gòu)建與評(píng)價(jià)研究

發(fā)布時(shí)間：2019-01-02 15:28

【摘要】：雙語(yǔ)詞典、平行語(yǔ)料庫(kù)等多語(yǔ)言資源是解決跨語(yǔ)言障礙,進(jìn)行多語(yǔ)言信息處理與服務(wù)的基礎(chǔ)資源,同時(shí)這些資源在某些領(lǐng)域或語(yǔ)種內(nèi)也是稀缺資源,存在著獲取瓶頸問(wèn)題。相比之下,可比語(yǔ)料不存在平行語(yǔ)料里譯文受原文限制的缺點(diǎn),容易獲取,并且從中提取的雙語(yǔ)詞對(duì)可用來(lái)擴(kuò)充雙語(yǔ)詞典,因此可比語(yǔ)料的構(gòu)建研究是一項(xiàng)很有意義的研究工作。一方面,可以豐富語(yǔ)料構(gòu)建的理論體系,另一方面,可以為多語(yǔ)言信息處理提供豐富并且可用的多語(yǔ)言語(yǔ)料資源�，F(xiàn)有的可比語(yǔ)料庫(kù)構(gòu)建主要針對(duì)新聞等通用領(lǐng)域,但實(shí)際應(yīng)用中有關(guān)專(zhuān)業(yè)領(lǐng)域可比語(yǔ)料的應(yīng)用需求也非常迫切；并且由于專(zhuān)業(yè)領(lǐng)域和通用領(lǐng)域的語(yǔ)料特點(diǎn)存在諸多不同,使得通用領(lǐng)域的可比語(yǔ)料構(gòu)建和評(píng)價(jià)方法及技術(shù)并不一定適用于專(zhuān)業(yè)領(lǐng)域的可比語(yǔ)料研究。基于此,本文對(duì)專(zhuān)業(yè)領(lǐng)域可比語(yǔ)料構(gòu)建及評(píng)價(jià)問(wèn)題進(jìn)行研究,探索中英領(lǐng)域可比語(yǔ)料的采集方法,并以跨語(yǔ)言相似度為基礎(chǔ)引入主題維度進(jìn)行語(yǔ)料可比度度量研究,最后通過(guò)內(nèi)部評(píng)價(jià)和外部評(píng)價(jià)對(duì)可比語(yǔ)料的質(zhì)量進(jìn)行綜合評(píng)估。在中英領(lǐng)域可比語(yǔ)料的采集研究中,本文分別以Web搜索引擎、在線(xiàn)百科全書(shū)、中英文學(xué)術(shù)數(shù)據(jù)庫(kù)等三種不同類(lèi)型的互聯(lián)網(wǎng)資源作為數(shù)據(jù)源,進(jìn)行專(zhuān)業(yè)領(lǐng)域可比語(yǔ)料庫(kù)的構(gòu)建,并對(duì)這些方法進(jìn)行比較分析。在語(yǔ)料可比度度量研究中,本文以詞語(yǔ)為單元,通過(guò)基于傳統(tǒng)統(tǒng)計(jì)的序列相似度(包括卡方統(tǒng)計(jì)、spearman系數(shù))、基于詞頻排序的序列相似度、基于術(shù)語(yǔ)度排序的序列相似度等三種不同方法在不同類(lèi)型語(yǔ)料(平行語(yǔ)料、可比語(yǔ)料、非可比語(yǔ)料等)進(jìn)行實(shí)驗(yàn),對(duì)語(yǔ)料整體進(jìn)行可比度度量。結(jié)果表明：基于術(shù)語(yǔ)度排序的方法性能最好,其次是基于詞頻的方法,基于傳統(tǒng)統(tǒng)計(jì)的方法性能最差。此外,關(guān)于可比語(yǔ)料研究大多采用單一指標(biāo),尚未形成較完善統(tǒng)一的評(píng)價(jià)體系,需要對(duì)可比語(yǔ)料的評(píng)價(jià)進(jìn)行深入研究。鑒于此,本文從內(nèi)部評(píng)價(jià)和外部評(píng)價(jià)兩方面對(duì)語(yǔ)料進(jìn)行綜合評(píng)估。內(nèi)部評(píng)價(jià)中以語(yǔ)料詞語(yǔ)總體特征、子語(yǔ)料相似性等為基礎(chǔ)進(jìn)行語(yǔ)料內(nèi)部一致性的評(píng)估；外部評(píng)價(jià)中通過(guò)雙語(yǔ)術(shù)語(yǔ)抽取任務(wù)間接評(píng)價(jià)語(yǔ)料質(zhì)量。在不同可比程度的語(yǔ)料(包括平行語(yǔ)料、可比語(yǔ)料、非可比語(yǔ)料)上的雙語(yǔ)術(shù)語(yǔ)抽取實(shí)驗(yàn)結(jié)果表明,可比度高的語(yǔ)料上獲取的術(shù)語(yǔ)質(zhì)量更高。
[Abstract]:Bilingual dictionaries, parallel corpus and other multilingual resources are the basic resources to solve the cross-language barriers and multilingual information processing and service. At the same time, these resources are also scarce resources in some fields or languages. In contrast, the comparable corpus does not have the disadvantage that the translation of the parallel corpus is restricted by the original text, and it is easy to obtain, and the bilingual pairs extracted from it can be used to expand the bilingual dictionary. Therefore, the construction of comparable corpus is a meaningful research work. On the one hand, it can enrich the theoretical system of corpus construction, on the other hand, it can provide abundant and usable multilingual data resources for multilingual information processing. The existing comparable corpus construction is mainly aimed at the general field such as news, but the application demand of the professional domain comparable corpus is also very urgent in the practical application. Because there are many differences between professional domain and general domain, the methods and techniques of comparable corpus construction and evaluation in general domain are not necessarily suitable for the research of comparable corpus in professional field. Based on this, this paper studies the construction and evaluation of professional domain comparable corpus, explores the methods of collecting Chinese and English domain comparable data, and introduces topic dimension to study the measurement of corpus comparability on the basis of cross-language similarity. Finally, the quality of comparable corpus is evaluated synthetically by internal and external evaluation. In the research of Chinese and English domain comparable corpus, this paper uses Web search engine, online encyclopedia, Chinese and English academic database as data sources to construct the professional domain comparable corpus. These methods are compared and analyzed. In the research of Corpus comparability, this paper takes words as the unit, through the traditional statistical sequence similarity (including chi-square statistics, spearman coefficient), based on word frequency ranking sequence similarity. Three different methods, such as sequence similarity degree based on term degree ranking, are experimented in different types of corpus (parallel corpus, comparable corpus, non-comparable corpus, etc.) to measure the comparability of the whole corpus. The results show that the performance of the method based on term degree ranking is the best, followed by the method based on word frequency, and the method based on traditional statistics has the worst performance. In addition, most of the research on comparable corpus is based on a single index, which has not yet formed a perfect and unified evaluation system, so it is necessary to conduct in-depth research on the evaluation of comparable corpus. In view of this, this paper evaluates the corpus from two aspects: internal evaluation and external evaluation. The internal evaluation is based on the general characteristics of the corpus and the similarity of the sub-corpus. In the external evaluation, the quality of the corpus is indirectly evaluated through the task of extracting the bilingual terms. The experimental results of bilingual terminology extraction on different comparable data (including parallel data, comparable data and non-comparable data) show that the quality of the terms obtained on the high comparable data is higher.
【學(xué)位授予單位】：南京理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類(lèi)號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前5條

1 孫廣范;宋金平;袁琦;肖健;單玉秋;;中英可比語(yǔ)料庫(kù)中翻譯等價(jià)對(duì)抽取方法研究[J];計(jì)算機(jī)工程與應(yīng)用;2007年32期

2 張永臣;孫樂(lè);李飛;李文波;西野文人;于浩;方高林;;基于Web數(shù)據(jù)的特定領(lǐng)域雙語(yǔ)詞典抽取[J];中文信息學(xué)報(bào);2006年02期

3 章成志;王惠臨;;多語(yǔ)言文本聚類(lèi)研究綜述[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2009年06期

4 康小麗;章成志;王惠臨;;基于可比語(yǔ)料庫(kù)的雙語(yǔ)術(shù)語(yǔ)抽取研究述評(píng)[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2009年10期

5 馮志偉;;關(guān)于信息時(shí)代的多語(yǔ)言問(wèn)題的一些思考[J];現(xiàn)代語(yǔ)文;2006年07期

相關(guān)碩士學(xué)位論文前1條

1 于海濤;可比較語(yǔ)料庫(kù)的研究與構(gòu)建[D];大連理工大學(xué);2009年

，

本文編號(hào)：2398671

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2398671.html

上一篇：一種信息資源整合優(yōu)化模型及其性能分析
下一篇：基于J2EE的地市級(jí)煙草專(zhuān)賣(mài)市場(chǎng)監(jiān)管信息系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

專(zhuān)業(yè)領(lǐng)域可比語(yǔ)料的構(gòu)建與評(píng)價(jià)研究