融合多特征的漢緬雙語主題模型構(gòu)建方法研究

發(fā)布時(shí)間：2018-12-29 18:33

【摘要】：漢-緬雙語平行語料是開展面向漢語-緬語機(jī)器翻譯、跨語言檢索、平行句對(duì)抽取和雙語實(shí)體抽取等研究的基礎(chǔ)性資源。跨語言的主題模型作為多語言文檔分析的基礎(chǔ)模型,它能夠從語義層面來計(jì)算不同語言文檔之間的相關(guān)性,為我們獲取漢-緬可比文檔以及平行語料庫的建設(shè)提供了良好的支撐,因此,研究如何構(gòu)建漢-緬雙語主題模型對(duì)于漢-緬可比文檔的獲取具有重要的意義。本文以語料庫構(gòu)建為出發(fā)點(diǎn),通過主題模型獲取可比語料為目的,對(duì)雙語主題模型的構(gòu)建展開了研究工作,主要取得了以下成果:(1)詳述漢-緬雙語平行語料庫的構(gòu)建。漢-緬雙語文本的資源稀缺,國(guó)內(nèi)外還沒有公開權(quán)威的漢-緬文本語料集,構(gòu)建漢緬雙語主題模型需要一定量的雙語平行文檔作為訓(xùn)練集,并且平行文檔的質(zhì)量將影響后續(xù)的文本主題模型的研究。本文詳細(xì)介紹了漢-緬雙語文本的獲取方法,包括網(wǎng)頁文本、電子雜志和微信平臺(tái)等資源。對(duì)于網(wǎng)頁文本,詳細(xì)介紹了利用爬蟲技術(shù)自動(dòng)獲取的過程,對(duì)于電子雜志和微信平臺(tái),也說明了人工獲取的過程。最后將資源整合為漢-緬雙語平行語料庫以及說明相應(yīng)的數(shù)據(jù)存儲(chǔ)方法。(2)提出一種融合上下文特征的漢-緬雙語主題模型。該模型以雙語LDA主題模型為基礎(chǔ),融合了文本的上下文特征。雙語LDA模型利用了平行文本的關(guān)聯(lián)性,即平行文本共享同一文本主題分布矩陣,而融合上下文特征則解決了詞袋模型沒有考慮文本結(jié)構(gòu)的問題。融合后的模型實(shí)質(zhì)是對(duì)降低了高頻詞對(duì)文本主題分布的負(fù)面影響,通過實(shí)驗(yàn)結(jié)果表明,本文提出的融合上下文特征的漢-緬雙語主題模型在文本主題分布上有著更好的效果。(3)提出一種融合語義擴(kuò)展的漢-緬雙語主題模型。以融合上下文特征的主題模型為基礎(chǔ),進(jìn)一步融合了漢-緬語義擴(kuò)展詞典,通過對(duì)詞典的解析和處理,構(gòu)建了漢-緬語義的擴(kuò)展集合,本文通過上下文特征對(duì)詞語加權(quán)權(quán)值,設(shè)定一個(gè)閾值,對(duì)超過閾值的詞語通過擴(kuò)展集合擴(kuò)展對(duì)應(yīng)的緬甸語文本,通過這種語義擴(kuò)展,可以解決緬甸語中一種詞語,多種表述的問題。我們將上下文特征和語義擴(kuò)展特征融合在同一個(gè)雙語LDA模型中,最后通過實(shí)驗(yàn)結(jié)果比較分析,本文構(gòu)建的融合多特征的雙語主題模型同對(duì)比實(shí)驗(yàn)比較有著更好的表現(xiàn)。
[Abstract]:Chinese-Burmese bilingual parallel corpus is the basic resource for the research of Chinese-Burmese machine translation, cross-language retrieval, parallel sentence pair extraction and bilingual entity extraction. As the basic model of multilingual document analysis, the cross-language topic model can calculate the correlation between different language documents from the semantic level. It provides a good support for the construction of Chinese-Burmese comparable documents and parallel corpus. Therefore, it is of great significance to study how to construct a Chinese-Burmese bilingual thematic model for the acquisition of Chinese-Burmese comparable documents. Taking corpus construction as the starting point and obtaining comparable corpus through thematic model, this paper studies the construction of bilingual thematic model. The main achievements are as follows: (1) the construction of Chinese-Myanmar bilingual parallel corpus is described in detail. The resources of Chinese-Myanmar bilingual texts are scarce, and there is no open and authoritative Chinese-Burmese text corpus at home and abroad. To construct the Chinese-Myanmar bilingual thematic model, a certain amount of bilingual parallel documents are needed as training sets. And the quality of parallel documents will affect the research of text topic model. This paper introduces the methods of obtaining Chinese-Burmese bilingual texts, including web text, e-magazine and WeChat platform. For the text of web pages, the process of automatically obtaining web pages using crawler technology is introduced in detail. For electronic magazines and WeChat platforms, the process of manual acquisition is also explained. Finally, the resources are integrated into a Chinese-Burmese bilingual parallel corpus and the corresponding data storage methods are illustrated. (2) A Chinese-Burmese bilingual thematic model is proposed, which combines the contextual features. The model is based on the bilingual LDA thematic model and combines the contextual features of the text. The bilingual LDA model utilizes the relevance of parallel text, that is, parallel text sharing the same text topic distribution matrix, while the fusion of context features solves the problem that the lexical bag model does not consider the text structure. The fusion model essentially reduces the negative influence of high-frequency words on the theme distribution of the text. The experimental results show that, The Chinese-Myanmar bilingual thematic model with contextual features proposed in this paper has a better effect on the text theme distribution. (3) A Chinese-Myanmar bilingual thematic model with semantic extension is proposed. Based on the subject model of blending context features, this paper further fuses the Chinese-Burmese semantic extension dictionary. Through the analysis and processing of the dictionary, the extended set of Chinese-Myanmar semantics is constructed, and the weighted weight of the words is given by the context feature in this paper. A threshold is set to extend the corresponding Myanmar language text by extending the set of words over the threshold. By this semantic extension, the problem of one word or a variety of expressions in the Burmese language can be solved. We fuse context features and semantic extended features into the same bilingual LDA model. Finally, by comparing and analyzing the experimental results, we conclude that the multi-feature bilingual thematic model constructed in this paper has a better performance than the comparative experiment.
【學(xué)位授予單位】：昆明理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

中國(guó)期刊全文數(shù)據(jù)庫前5條

1 關(guān)鵬;王曰芬;傅柱;;不同語料下基于LDA主題模型的科學(xué)文獻(xiàn)主題抽取效果分析[J];圖書情報(bào)工作;2016年02期

2 趙煜;邵必林;邊根慶;;一種融合詞序信息的多粒度文本話題情感聯(lián)合模型[J];西安交通大學(xué)學(xué)報(bào);2014年11期

3 陳霞楓;;緬甸改革對(duì)中緬關(guān)系的影響及中國(guó)的對(duì)策[J];東南亞研究;2013年01期

4 馬穎華,王永成,蘇貴洋,張宇萌;一種基于字同現(xiàn)頻率的漢語文本主題抽取方法[J];計(jì)算機(jī)研究與發(fā)展;2003年06期

5 楊沐昀;A Research on Bilingual Dictionary Based Sentence Alignment for Chinese English Parallel Corpus[J];High Technology Letters;2002年01期

，

本文編號(hào)：2395219

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2395219.html

上一篇：基于Retinex方法的無人機(jī)影像陰影去除應(yīng)用研究
下一篇：CLM:面向軌跡發(fā)布的差分隱私保護(hù)方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

融合多特征的漢緬雙語主題模型構(gòu)建方法研究