基于LDA的蒙古文信息檢索方法研究與系統(tǒng)實(shí)現(xiàn)
發(fā)布時(shí)間:2018-05-09 22:33
本文選題:蒙古語(yǔ) + LDA主題模型; 參考:《內(nèi)蒙古師范大學(xué)》2016年碩士論文
【摘要】:隨著網(wǎng)絡(luò)技術(shù)的不斷發(fā)展及信息的全球化,使我們能隨時(shí)隨地從互聯(lián)網(wǎng)上獲取自己所需的信息,帶來(lái)了極大便利,同時(shí)也推動(dòng)了少數(shù)民族語(yǔ)言文字網(wǎng)絡(luò)化應(yīng)用的發(fā)展,對(duì)少數(shù)民族語(yǔ)言文字順應(yīng)信息時(shí)代發(fā)展的需求和搜索引擎的發(fā)展起著十分積極的作用。蒙古文是我國(guó)影響力較高的少數(shù)民族語(yǔ)言文字之一,近年來(lái)隨著網(wǎng)絡(luò)上蒙古文信息的日益豐富,如何在大量的網(wǎng)絡(luò)信息資源中快速、準(zhǔn)確地找出滿(mǎn)足用戶(hù)需求的蒙古文信息,是當(dāng)前蒙古文信息檢索技術(shù)需要迫切解決的問(wèn)題。傳統(tǒng)的蒙古文信息檢索系統(tǒng)更多基于關(guān)鍵詞匹配進(jìn)行檢索,僅考慮詞與詞之間的字面匹配,未充分利用詞之間語(yǔ)義層面的關(guān)聯(lián)信息。事實(shí)上,不同用戶(hù)使用同樣關(guān)鍵詞來(lái)描述同一對(duì)象的概率往往小于20%,并且蒙古文語(yǔ)言表達(dá)形式多樣,一詞多意與多詞一意現(xiàn)象較普遍,這使得查詢(xún)結(jié)果與用戶(hù)所需信息差距較大,造成檢索效果不佳。針對(duì)上述問(wèn)題,本文主要從挖掘文檔主題語(yǔ)義信息方面尋找解決方案,通過(guò)LDA主題模型提取文檔中隱含的主題和主題共現(xiàn)關(guān)系,從而利用文檔的隱含主題語(yǔ)義信息為檢索服務(wù),改善檢索效果。具體工作說(shuō)明如下:本文提出了一種LDA主題模型與語(yǔ)言模型相結(jié)合的蒙古文信息檢索方法。該方法首先對(duì)蒙古文文本建立一元和二元語(yǔ)言模型,得到文本的語(yǔ)言概率分布;然后基于LDA建立主題模型,利用吉普斯抽樣方法計(jì)算模型的參數(shù),挖掘得到文檔隱含的主題概率分布;最后,計(jì)算出文檔主題分布與語(yǔ)言分布的線(xiàn)性組合概率分布,以此分布來(lái)計(jì)算文檔主題與查詢(xún)關(guān)鍵詞之間的相似度,最后返回與查詢(xún)關(guān)鍵詞主題最相關(guān)的文檔。方法中語(yǔ)言模型能充分利用蒙古文語(yǔ)法特征,而LDA主題模型有良好的主題發(fā)現(xiàn)及泛化學(xué)習(xí)能力,結(jié)合這兩種方法能更好地實(shí)現(xiàn)蒙古文文檔的主題語(yǔ)義檢索,提高檢索準(zhǔn)確性。通過(guò)在國(guó)際編碼標(biāo)準(zhǔn)的小學(xué)蒙語(yǔ)文教材語(yǔ)料測(cè)試集上進(jìn)行實(shí)驗(yàn),結(jié)果表明相對(duì)于傳統(tǒng)的基于關(guān)鍵詞和獨(dú)立使用LDA主題模型的信息檢索方法,本文方法提高了信息檢索的準(zhǔn)確率與召回率,驗(yàn)證了方法的有效性與實(shí)用性。在此基礎(chǔ)上,本文還設(shè)計(jì)實(shí)現(xiàn)了面向教育應(yīng)用的蒙語(yǔ)文教材語(yǔ)料庫(kù)信息檢索系統(tǒng),該系統(tǒng)采用Java Web框架設(shè)計(jì)實(shí)現(xiàn),能對(duì)語(yǔ)料庫(kù)內(nèi)容進(jìn)行全文檢索,以及按標(biāo)題、版本號(hào)、出版社、教育階段等條目進(jìn)行數(shù)據(jù)庫(kù)檢索,檢索結(jié)果頁(yè)面能按傳統(tǒng)蒙古文的習(xí)慣從左到右豎排顯示,相關(guān)內(nèi)容能高亮顯示。
[Abstract]:With the continuous development of network technology and the globalization of information, we can get the information we need from the Internet anytime and anywhere, which brings great convenience, and also promotes the development of the network application of minority languages and characters. It plays an active role in meeting the needs of the development of the information age and the development of search engines. Mongolian is one of the most influential minority languages in China. In recent years, with the increasing enrichment of Mongolian information on the Internet, how to quickly and accurately find Mongolian information to meet the needs of users in a large number of network information resources. At present, Mongolian information retrieval technology needs to be solved urgently. The traditional Mongolian information retrieval system is more based on keyword matching, only considering the literal matching between words, and does not make full use of the semantic level of related information between words. In fact, the probability of different users using the same keyword to describe the same object is often less than 20, and the Mongolian language has various forms of expression. This results in a large gap between the query results and the information required by the user, resulting in poor retrieval results. In view of the above problems, this paper mainly looks for the solution from the aspect of mining document topic semantic information, extracts the implied topic and topic co-occurrence relation through the LDA topic model, and then uses the document implicit topic semantic information for the retrieval service. Improve the retrieval effect. The main work is as follows: this paper presents a Mongolian information retrieval method which combines LDA subject model with language model. In this method, the monadic and binary language models are established for Mongolian text, and the linguistic probability distribution of the text is obtained, and then the subject model based on LDA is established, and the parameters of the model are calculated by using Gyibug sampling method. Finally, the linear combination probability distribution of document topic distribution and language distribution is calculated to calculate the similarity between document topic and query keywords. At last, we return the document most relevant to the key topic of the query. In this method, the language model can make full use of the Mongolian grammatical features, while the LDA topic model has good topic discovery and generalization learning ability. Combining these two methods, the topic semantic retrieval of Mongolian documents can be better realized and the retrieval accuracy can be improved. The experiment is carried out on the corpus test set of primary school Mongolian language teaching materials in international coding standard. The results show that compared with the traditional information retrieval method based on keyword and independent use of LDA subject model, This method improves the accuracy and recall rate of information retrieval, and verifies the effectiveness and practicability of the method. On this basis, this paper also designs and implements a corpus information retrieval system for Mongolian Chinese teaching materials oriented to educational applications. The system is designed and implemented by Java Web framework, which can retrieve the content of the corpus in full text, as well as according to the title and version number. Publishing house, education stage and other items are searched in database. The retrieval result page can be displayed vertically from left to right according to the traditional Mongolian custom, and the relevant contents can be highlighted.
【學(xué)位授予單位】:內(nèi)蒙古師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP391.3
,
本文編號(hào):1867736
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1867736.html
最近更新
教材專(zhuān)著