基于雙語文檔相似度的跨語言文檔排序?qū)W習(xí)方法研究
發(fā)布時間:2018-03-30 22:33
本文選題:信息檢索 切入點(diǎn):雙語文檔相似度 出處:《昆明理工大學(xué)》2017年碩士論文
【摘要】:跨語言的信息檢索是當(dāng)前研究的熱點(diǎn),對跨語言文檔分析以及跨語言新聞獲取等研究領(lǐng)域具有重要的作用。當(dāng)前的跨語言信息檢索的研究主要集中在基于查詢翻譯和文檔翻譯的方法,對基于統(tǒng)計概率的機(jī)器翻譯十分依賴,面臨著訓(xùn)練語料難以獲取以及翻譯精度低等問題。目前基于排序?qū)W習(xí)的信息檢索研究集中在單語言的文檔排序上,跨語言的文檔排序?qū)W習(xí)并沒有得到很大關(guān)注。本文提出一種基于雙語文檔相似度的跨語言文檔排序?qū)W習(xí)模型,利用機(jī)器學(xué)習(xí)的方法訓(xùn)練出排序函數(shù),并融合雙語文檔的相似度因素對跨語言文檔進(jìn)行排序。本文在構(gòu)建跨語言的文檔排序?qū)W習(xí)模型過程中主要解決了以下兩個問題:1.提出了雙語文檔之間的相似度計算方法:針對雙語文檔相似度計算過程中難以對不同語言的文檔進(jìn)行統(tǒng)一空間表示的問題,提出了基于雙語詞嵌入的雙語文檔相似度計算方法,首先對雙語文檔進(jìn)行關(guān)鍵詞提取,然后把雙語文檔的關(guān)鍵詞映射到同一個語義空間,并用這些關(guān)鍵詞之間的距離來表示雙語文檔之間的相似度。實(shí)驗(yàn)結(jié)果表明,提出方法能夠很好地對雙語文檔之間的相似度進(jìn)行計算。2.構(gòu)建了基于雙語文檔相似度的跨語言文檔排序?qū)W習(xí)模型:針對基于點(diǎn)和基于對的排序?qū)W習(xí)損失函數(shù)不能準(zhǔn)確地對排序損失進(jìn)行表示的問題,本文采用基于列表的概率分布交叉熵的損失函數(shù)以及基于人工神經(jīng)網(wǎng)絡(luò)的排序函數(shù)來構(gòu)建排序?qū)W習(xí)模型,提出了融合雙語文檔相似度的特征來對跨語言文檔進(jìn)行統(tǒng)一排序的方法,以雙語文檔相似度作為對目標(biāo)語言進(jìn)行排序打分的依據(jù)。實(shí)驗(yàn)結(jié)果表明提出的跨語言文檔排序?qū)W習(xí)模型在英漢和英越兩種語料集下表現(xiàn)了很好的排序效果。
[Abstract]:Cross-language information retrieval is a hot topic in current research. It plays an important role in the field of cross-language document analysis and cross-language news acquisition. The current research on cross-language information retrieval mainly focuses on the methods of query translation and document translation. Machine translation based on statistical probability is very dependent, and it is faced with the problems of difficult acquisition of training corpus and low translation accuracy. At present, the research of information retrieval based on sorting learning is focused on the sorting of documents in a single language. Cross-language document sorting learning has not been paid much attention. In this paper, a cross-language document sorting learning model based on bilingual document similarity is proposed, and the sorting function is trained by machine learning. Combining the similarity factors of bilingual documents to sort the cross-language documents, this paper mainly solves the following two problems: 1. In the process of constructing a cross-language document sorting learning model, we propose a similarity meter between bilingual documents. Calculation methods: in the process of calculating the similarity of bilingual documents, it is difficult to unify the spatial representation of documents in different languages. This paper proposes a method for calculating the similarity of bilingual documents based on the embedding of bilingual words. Firstly, the keywords of bilingual documents are extracted, then the keywords of bilingual documents are mapped to the same semantic space. The distance between these keywords is used to express the similarity between bilingual documents. The experimental results show that, The proposed method can well calculate the similarity between bilingual documents. 2. A cross-language document ranking learning model based on bilingual document similarity is constructed. The loss function of sorting based on point and pair cannot be used. The problem of accurately representing the sort loss, In this paper, the loss function of cross-entropy of probability distribution based on list and the sort function based on artificial neural network are used to construct the ranking learning model. This paper proposes a method of uniform sorting of cross-language documents by combining the similarity features of bilingual documents. Based on the similarity of bilingual documents as the basis for sorting the target language, the experimental results show that the proposed cross-language document sorting learning model performs well in both English-Chinese and English-Vietnamese corpus.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 郝嘉樹;王惠臨;劉耀;;基于本體的跨語言信息檢索模型和關(guān)鍵技術(shù)研究[J];情報科學(xué);2009年02期
2 鄭德權(quán);李生;趙鐵軍;于浩;;結(jié)合本體論和統(tǒng)計方法的跨語言信息檢索模型[J];哈爾濱工業(yè)大學(xué)學(xué)報;2008年01期
3 姚文琳;王存剛;任麗婕;仇利克;郜振霞;;基于核心概念集的多語言O(shè)ntology[J];計算機(jī)應(yīng)用研究;2006年04期
4 張俊林;曲為民;杜林;孫玉芳;;跨語言信息檢索研究進(jìn)展[J];計算機(jī)科學(xué);2004年07期
5 王進(jìn),陳恩紅,張振亞,王煦法;基于本體的跨語言信息檢索模型[J];中文信息學(xué)報;2004年03期
6 徐紅姣;王惠臨;;跨語言信息檢索中的查詢翻譯方法研究[J];數(shù)字圖書館論壇;2009年04期
,本文編號:1687979
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1687979.html
最近更新
教材專著