藏文搜索和搜索結(jié)果聚類研究及系統(tǒng)實現(xiàn)

發(fā)布時間：2018-07-06 16:45

本文選題：藏文分詞 + 藏文聚類��；參考：《西南交通大學》2013年碩士論文

【摘要】：藏文歷史悠久,是藏族文化和藏族文明傳承的載體,使用人數(shù)有600多萬。藏文文獻數(shù)目龐大,內(nèi)容廣泛。隨著windows系統(tǒng)對藏文的支持,藏族同胞參入網(wǎng)絡(luò)活動的熱情日益高漲。然而當前尚無藏文搜索引擎,國內(nèi)外各大著名搜索引擎也不提供藏文搜索,因而對藏文搜索系統(tǒng)的研究意義重大。本文圍繞如何實現(xiàn)藏文搜索系統(tǒng),研究了藏文分詞,藏文文本收集,文本處理,編碼轉(zhuǎn)換,索引搜索及結(jié)果聚類等相關(guān)問題,旨在實現(xiàn)一個功能完善的藏文信息檢索系統(tǒng)。本文的主要工作如下：第一,提出了一種AllCut藏文分詞算法。藏文詞間沒有分隔符,因而需要分詞。當前分詞算法主要有基于統(tǒng)計概率、詞性標注及語法規(guī)則等。然而這些算法或需要大量的語料訓練學習,或?qū)崿F(xiàn)起來很復(fù)雜,在當前情況下難以實現(xiàn)或?qū)崿F(xiàn)效果并不好。因而本方案使用詞典匹配,結(jié)合藏文的語法特性及格助詞和接續(xù)性特征,同時使用細粒度切分,取得了很好的分詞效果,為接下來工作提供了保障。第二,藏文聚類研究。本文首先研究了中藏聚類中文文本表示,藏文停詞等相關(guān)問題：使用向量模型表示文檔,使得文本可以很好的被計算機存儲和處理；通過統(tǒng)計大量文檔得到藏文停詞,排除了這些詞對聚類效果的干擾。最后系統(tǒng)研究了及劃分法和層次法聚類算法對于藏文的聚類效果。第三,藏文信息檢索研究及系統(tǒng)實現(xiàn)。藏文信息檢索主要研究了藏文網(wǎng)頁收集,藏文編碼轉(zhuǎn)換,藏文網(wǎng)頁預(yù)處理,及藏文文本存儲等,解決了計算機對藏文的處理和檢索；然后以Lucene為基礎(chǔ),實現(xiàn)了該搜索系統(tǒng),系統(tǒng)能夠自動更發(fā)現(xiàn)更新藏文資源,提供藏文搜索功能,完成了藏文搜索引擎的功能。并結(jié)合藏文聚類對搜索結(jié)果聚類顯示,提高了搜索結(jié)果的針對性和準確性。
[Abstract]:The Tibetan language has a long history and is the carrier of Tibetan culture and Tibetan civilization, with more than 6 million users. Tibetan literature is large in number and extensive in content. With the support of the windows system for Tibetan, Tibetan people's enthusiasm to participate in network activities is growing. However, there is no Tibetan search engine at present, and famous search engines at home and abroad do not provide Tibetan search, so the research on Tibetan search system is of great significance. This paper focuses on how to realize the Tibetan language search system, studies the Tibetan participle, the Tibetan text collection, the text processing, the coding conversion, the index search and the result clustering and so on. The purpose of this paper is to realize a perfect Tibetan information retrieval system. The main work of this paper is as follows: first, an all cut Tibetan word segmentation algorithm is proposed. There are no delimiters between Tibetan words, so participle is needed. Current word segmentation algorithms are mainly based on statistical probability, part of speech tagging and grammar rules. However, these algorithms require a lot of corpus training and learning, or they are very complex to implement, which are difficult to implement or not effective in the current situation. Therefore, this scheme uses dictionary matching, combines the grammatical characteristics of Tibetan and the features of case auxiliary and continuity, at the same time uses fine granularity segmentation, and achieves a good segmentation effect, which provides a guarantee for the next work. Second, the study of Tibetan clustering. In this paper, we first study the Chinese text representation, Tibetan word stopping and other related problems: using vector model to represent documents, so that the text can be well stored and processed by computer, through statistics a large number of documents to obtain Tibetan stop words, The interference of these words to the clustering effect is excluded. Finally, the clustering effect of partitioning and hierarchical clustering algorithm for Tibetan is studied systematically. Third, Tibetan information retrieval research and system implementation. Tibetan information retrieval mainly studies Tibetan web page collection, Tibetan coding conversion, Tibetan web page preprocessing, Tibetan text storage and so on, which solves the problem of computer processing and retrieval of Tibetan language, and then realizes the search system based on Lucene. The system can automatically discover and update Tibetan resources, provide Tibetan search function, and complete the function of Tibetan search engine. Combined with Tibetan clustering to display search results, improve the pertinence and accuracy of search results.
【學位授予單位】：西南交通大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP391.1

【參考文獻】

相關(guān)期刊論文前9條

1 扎西次仁;《中華大藏經(jīng)·丹珠爾》藏文對勘本字頻統(tǒng)計分析[J];中國藏學;1997年02期

2 陳玉忠,俞士汶;藏文信息處理技術(shù)的研究現(xiàn)狀與展望[J];中國藏學;2003年04期

3 劉群,張華平,俞鴻魁,程學旗;基于層疊隱馬模型的漢語詞法分析[J];計算機研究與發(fā)展;2004年08期

4 于江蘇,葛小沖;計算機藏文信息處理的研究與設(shè)計[J];中文信息學報;1988年01期

5 陳玉忠,李保利,俞士汶;藏文自動分詞系統(tǒng)的設(shè)計與實現(xiàn)[J];中文信息學報;2003年03期

6 春燕;曲珍;;藏文文本編碼識別方法研究[J];計算機工程與應(yīng)用;2013年01期

7 祁坤鈺;;信息處理用藏文自動分詞研究[J];西北民族大學學報(哲學社會科學版);2006年04期

8 高定國;關(guān)白;;回顧藏文信息處理技術(shù)的發(fā)展[J];西藏大學學報(社會科學版);2009年03期

9 陳玉忠,李保利,俞士汶,蘭措吉;基于格助詞和接續(xù)特征的藏文自動分詞方案[J];語言文字應(yīng)用;2003年01期

相關(guān)會議論文前3條

1 陳玉忠;;信息處理用現(xiàn)代藏語詞語的分類方案[A];第十屆全國少數(shù)民族語言文字信息處理學術(shù)研討會論文集[C];2005年

2 劉匯丹;芮建武;吳健;;藏文網(wǎng)頁的編碼識別與轉(zhuǎn)換[A];中文信息處理前沿進展——中國中文信息學會二十五周年學術(shù)會議論文集[C];2006年

3 戴玉剛;;藏文網(wǎng)頁采集技術(shù)研究[A];民族語言文字信息技術(shù)研究——第十一屆全國民族語言文字信息學術(shù)研討會論文集[C];2007年

，

本文編號：2103468

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2103468.html

上一篇：淺談網(wǎng)絡(luò)搜索引擎的應(yīng)用
下一篇：利用條件概率與乘法公式解釋搜索引擎拼寫糾錯功能的原理

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

藏文搜索和搜索結(jié)果聚類研究及系統(tǒng)實現(xiàn)