藏文網(wǎng)頁定題采集方法研究

發(fā)布時(shí)間：2018-04-04 06:17

本文選題：Web檢索　切入點(diǎn)：藏文網(wǎng)頁采集　出處：《長(zhǎng)安大學(xué)》2012年碩士論文

【摘要】：與漢文相比，藏文信息處理技術(shù)發(fā)展較慢，加之缺乏支持藏文搜索引擎，互聯(lián)網(wǎng)上的藏文信息常常處于“孤立狀態(tài)”，給用戶的查找和獲取帶來較大的困難。因此，探討一種通過網(wǎng)絡(luò)采集藏文信息的方法，對(duì)于藏文研究者顯得尤為重要。在分析了網(wǎng)頁采集流程、網(wǎng)絡(luò)爬蟲工作基本原理和主題網(wǎng)頁采集的相關(guān)知識(shí)的基礎(chǔ)上，對(duì)藏文網(wǎng)頁的采集方法進(jìn)行了深入研究： 1．對(duì)比分析藏文網(wǎng)頁的字體、藏文音節(jié)點(diǎn)、藏文高頻詞等區(qū)別于其他網(wǎng)頁的特征參數(shù)，設(shè)計(jì)出適合于判斷藏文網(wǎng)頁的相關(guān)算法。 2．探討了藏文主題爬蟲的關(guān)鍵技術(shù)，，如藏文分詞、主題判斷方法以及爬蟲的爬行策略等內(nèi)容，提出基于“導(dǎo)向詞”的藏文主題判斷方法。 3．研究Heritrix軟件，并通過對(duì)其關(guān)鍵模塊Extractor和Frontierscheduler的改進(jìn)和擴(kuò)展，實(shí)現(xiàn)“導(dǎo)向詞”算法的藏文主題信息網(wǎng)站的抓取；另外，運(yùn)用哈希算法，擴(kuò)展Queue-assignment-policy模塊，大大提升了爬蟲的采集效率。 4．利用HTMLParse軟件對(duì)采集的新聞信息進(jìn)行提取，并將新聞的標(biāo)題、發(fā)布時(shí)間、來源、正文信息存入數(shù)據(jù)庫。 5．對(duì)采集的藏文網(wǎng)頁文本進(jìn)行編碼“歸一化”處理，轉(zhuǎn)化成國(guó)際標(biāo)準(zhǔn)的Unicode編碼。利用上述研究結(jié)果，以網(wǎng)頁的查準(zhǔn)率和查全率為參考指標(biāo)，對(duì)“導(dǎo)向詞”主題判斷算法的幾個(gè)闕值進(jìn)行了測(cè)試，根據(jù)測(cè)試的結(jié)果對(duì)中國(guó)西藏網(wǎng)進(jìn)行了網(wǎng)頁抓取，抓取的準(zhǔn)確率在62%左右。測(cè)試數(shù)據(jù)表明，研究結(jié)果對(duì)于藏文定題信息采集行之有效，具有較高的應(yīng)用和理論參考價(jià)值。
[Abstract]:Compared with the Chinese language, Tibetan information processing technology develops slowly, coupled with the lack of support for Tibetan search engine, Tibetan information on the Internet is often in an "isolated state", which brings great difficulties to the users to find and obtain.Therefore, it is very important for Tibetan researchers to explore a method of collecting Tibetan information through network.On the basis of analyzing the process of web page collection, the basic principle of web crawler and the related knowledge of subject page collection, the collection method of Tibetan web page is deeply studied.1.By comparing and analyzing the characters of Tibetan web pages, such as font, syllable points, high-frequency words and so on, the relevant algorithms suitable for judging Tibetan web pages are designed.2.This paper discusses the key techniques of Tibetan theme crawler, such as the participle of Tibetan language, the judgment method of theme and the crawling strategy of crawler, and puts forward the judgment method of Tibetan subject based on "leading word".3.This paper studies the Heritrix software, improves and extends its key modules, Extractor and Frontierscheduler, realizes the acquisition of Tibetan subject information website of the "leading word" algorithm, and extends the Queue-assignment-policy module by using hash algorithm, which greatly improves the efficiency of crawler collection.4.The HTMLParse software is used to extract the news information collected, and the title, release time, source and text information of the news are stored in the database.5.The collected Tibetan web page text is coded "normalized" and transformed into international standard Unicode code.Taking the precision and recall rate of the web page as the reference index, this paper tests several threshold values of the theme judgment algorithm of "leading word", and grabs the web page of China Tibet net according to the results of the test.The capture accuracy is about 62%.The test data show that the research results are effective for the collection of Tibetan thematic information and have high application and theoretical reference value.
【學(xué)位授予單位】：長(zhǎng)安大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP393.09;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 扎西次仁;《中華大藏經(jīng)·丹珠爾》藏文對(duì)勘本字頻統(tǒng)計(jì)分析[J];中國(guó)藏學(xué);1997年02期

2 陳玉忠,俞士汶;藏文信息處理技術(shù)的研究現(xiàn)狀與展望[J];中國(guó)藏學(xué);2003年04期

3 馮沖;黃河燕;陳肇雄;張亮;;基于字符層馬爾科夫模型的多語種識(shí)別[J];計(jì)算機(jī)科學(xué);2006年01期

4 吳麗輝 ,王斌 ,余智華;一種通用Web信息采集系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2005年03期

5 珠杰;歐珠;格桑多吉;;基于DOM修剪的藏文Web信息提取[J];計(jì)算機(jī)工程;2008年24期

6 李衛(wèi)疆;趙鐵軍;樸星海;;網(wǎng)絡(luò)爬行器的分布式設(shè)計(jì)[J];計(jì)算機(jī)工程;2009年04期

7 周立柱,林玲;聚焦爬蟲技術(shù)研究綜述[J];計(jì)算機(jī)應(yīng)用;2005年09期

8 李永宏;何向真;艾金勇;于洪志;;藏文編碼方式及其相互轉(zhuǎn)換[J];計(jì)算機(jī)應(yīng)用;2009年07期

9 王維蘭;現(xiàn)代藏文語言單位頻率和頻級(jí)關(guān)系的統(tǒng)計(jì)分析[J];科學(xué)技術(shù)與工程;2004年05期

10 陳玉忠,李保利,俞士汶;藏文自動(dòng)分詞系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];中文信息學(xué)報(bào);2003年03期

相關(guān)會(huì)議論文前2條

1 劉匯丹;芮建武;吳健;;藏文網(wǎng)頁的編碼識(shí)別與轉(zhuǎn)換[A];中文信息處理前沿進(jìn)展——中國(guó)中文信息學(xué)會(huì)二十五周年學(xué)術(shù)會(huì)議論文集[C];2006年

2 戴玉剛;;藏文網(wǎng)頁采集技術(shù)研究[A];民族語言文字信息技術(shù)研究——第十一屆全國(guó)民族語言文字信息學(xué)術(shù)研討會(huì)論文集[C];2007年

相關(guān)碩士學(xué)位論文前6條

1 王思麗;藏文網(wǎng)頁自動(dòng)發(fā)現(xiàn)與采集技術(shù)研究[D];西北民族大學(xué);2010年

2 華大年;手機(jī)產(chǎn)品信息垂直搜索引擎系統(tǒng)設(shè)計(jì)與開發(fā)[D];武漢理工大學(xué);2011年

3 李京京;主題爬蟲的關(guān)鍵技術(shù)研究[D];吉林大學(xué);2008年

4 劉運(yùn)佳;基于Lucene和Heririx構(gòu)建搜索引擎的研究和示例實(shí)現(xiàn)[D];電子科技大學(xué);2008年

5 春燕;藏文編碼識(shí)別與轉(zhuǎn)換算法的研究與實(shí)現(xiàn)[D];西南交通大學(xué);2010年

6 普布旦增;藏文自動(dòng)分詞技術(shù)方法研究[D];西藏大學(xué);2010年

本文編號(hào)：1708729

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1708729.html

上一篇：改變架構(gòu)=改變前途
下一篇：專業(yè)搜索引擎的開發(fā)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

藏文網(wǎng)頁定題采集方法研究