校園網(wǎng)搜索引擎中網(wǎng)頁去重技術(shù)的研究

發(fā)布時(shí)間：2019-06-15 20:14

【摘要】：隨著校園網(wǎng)建設(shè)的迅速發(fā)展，校園網(wǎng)信息資源迅速增加，這使得全校師生迅速定位有價(jià)值的信息難度較大，浪費(fèi)時(shí)間而且效率低下�；谛@網(wǎng)自身的特點(diǎn)，發(fā)展較成熟的通用搜索引擎不能完全適用于校園網(wǎng)，并且大量轉(zhuǎn)載網(wǎng)頁的存在造成檢索結(jié)果重復(fù)頁過多。通過分析校園網(wǎng)網(wǎng)頁的特點(diǎn)和現(xiàn)有去重技術(shù)，以解決校園網(wǎng)搜索引擎檢索結(jié)果重復(fù)網(wǎng)頁過多問題，針對不同類型的重復(fù)網(wǎng)頁，采用在索引和實(shí)時(shí)檢索時(shí)分別去重的策略，構(gòu)建了校園網(wǎng)搜索引擎，完成了如下幾項(xiàng)工作：第一，對網(wǎng)頁去重的準(zhǔn)備工作進(jìn)行了研究和分析。首先，分析網(wǎng)頁噪音產(chǎn)生的原因、噪音的定義及類型，采用合并內(nèi)容塊技術(shù)對原始網(wǎng)頁集進(jìn)行噪音去除和正文抽取，以獲得網(wǎng)頁的正文內(nèi)容。其次，研究中文分詞技術(shù)，對比現(xiàn)有中文分詞技術(shù)，最終采用庖丁解牛分詞軟件，對Nutch進(jìn)行二次開發(fā)——修改Nutch源碼，實(shí)現(xiàn)中文分詞。第二，對索引時(shí)網(wǎng)頁去重算法進(jìn)行研究和改進(jìn)。分析現(xiàn)有算法，針對完全重復(fù)或部分重復(fù)的網(wǎng)頁，采用最長段落簽名的網(wǎng)頁去重算法。首先對整篇文檔簽名后去重，其次對去重過濾后的文檔分段，對分好的段落排序，再取前N個(gè)段落對其進(jìn)行指紋簽名，將其作為文檔的特征，當(dāng)這兩個(gè)文檔中相同段落數(shù)超過系統(tǒng)給定的一個(gè)閾值時(shí)，就判定這兩個(gè)文檔為相互重復(fù)的文檔。提取前N段并進(jìn)行指紋排序大大降低了計(jì)算的復(fù)雜度。實(shí)驗(yàn)證明，該方法有較高的去重準(zhǔn)確率。第三，針對網(wǎng)頁轉(zhuǎn)載時(shí)對原網(wǎng)頁進(jìn)行微小修改而產(chǎn)生的重復(fù)網(wǎng)頁，在實(shí)時(shí)檢索時(shí)采用優(yōu)化傅立葉變換去重算法。該算法把每篇文檔的每個(gè)詞映射成一個(gè)數(shù)值Fingerprint，那么每篇文檔就可以表示成一個(gè)離散數(shù)值序列。對該數(shù)值序列進(jìn)行傅立葉變換得到傅立葉系數(shù)，比較系數(shù)的前若干項(xiàng)即可大致比較出兩個(gè)數(shù)列的相似性。實(shí)驗(yàn)證明，基于優(yōu)化傅立葉變換的去重算法能夠在網(wǎng)頁發(fā)生修改的時(shí)候兼顧查全率和去重率。以Nutch作為系統(tǒng)的開發(fā)工具，，通過對Nutch源碼進(jìn)行修改實(shí)現(xiàn)索引時(shí)的去重算法，并采用插件形式實(shí)現(xiàn)檢索時(shí)的網(wǎng)頁去重算法，在Nutch的基礎(chǔ)上設(shè)計(jì)實(shí)現(xiàn)校園網(wǎng)搜索引擎，并詳細(xì)說明了校園網(wǎng)搜索引擎系統(tǒng)開發(fā)過程和方法。最后對提出的去重策略進(jìn)行實(shí)驗(yàn)性能測試，采用Nutch爬取校園網(wǎng)網(wǎng)頁作為實(shí)驗(yàn)的數(shù)據(jù)集，結(jié)果表明將兩種算法結(jié)合的去重策略提高了搜索結(jié)果的精確度和去重的準(zhǔn)確率，并且搭建的校園網(wǎng)搜索引擎系統(tǒng)能夠有效的、正常的運(yùn)行。
[Abstract]:With the rapid development of campus network construction, campus network information resources increase rapidly, which makes it difficult for teachers and students to locate valuable information quickly, waste time and low efficiency. Based on the characteristics of campus network, the developed general search engine can not be fully suitable for campus network, and a large number of reprinted web pages cause too many repeated pages of retrieval results. By analyzing the characteristics of campus network web pages and the existing de-emphasis technology, in order to solve the problem of excessive repeated web pages in campus network search engine retrieval results, aiming at different types of repeated web pages, the campus network search engine is constructed by using the strategy of index and real-time retrieval, and the following work has been done: first, the preparation of web pages is studied and analyzed. Firstly, the causes of web page noise, the definition and type of noise are analyzed, and the noise removal and text extraction of the original web page set are carried out by using the merged content block technology to obtain the text content of the web page. Secondly, the Chinese word segmentation technology is studied, and the existing Chinese word segmentation technology is compared. finally, the second development of Nutch is carried out, which is to modify the Nutch source code and realize the Chinese word segmentation. Secondly, the algorithm of web page de-weight in index is studied and improved. The existing algorithms are analyzed, and the longest paragraph signed page de-repetition algorithm is used for completely repeated or partially repeated web pages. Firstly, the whole document is signed, then the filtered document is segmented, the divided paragraphs are sorted, and then the first N paragraphs are fingerprint signed as the characteristics of the document. When the same number of drops in the two documents exceeds a threshold given by the system, the two documents are determined to be duplicated documents. The computational complexity is greatly reduced by extracting the first N segments and sorting the fingerprints. The experimental results show that the method has a high accuracy of weight removal. Thirdly, the optimized Fourier transform algorithm is used in real-time retrieval for repeated web pages which are slightly modified when the page is reprinted. The algorithm maps each word of each document to a numerical Fingerprint, so that each document can be represented as a discrete numerical sequence. The Fourier coefficients are obtained by Fourier transform of the numerical sequence, and the similarities between the two sequences can be roughly compared by comparing the first several terms of the coefficients. The experimental results show that the algorithm based on optimized Fourier transform can take into account the recall rate and the recall rate when the web page is modified. Taking Nutch as the development tool of the system, the algorithm of index is realized by modifying the source code of Nutch, and the algorithm of web page de-weight is realized in the form of plug-in. The campus network search engine is designed and implemented on the basis of Nutch, and the development process and method of campus network search engine system are explained in detail. Finally, the experimental performance of the proposed weight removal strategy is tested, and the Nutch crawling campus network web page is used as the experimental data set. The results show that the combination of the two algorithms improves the accuracy of search results and the accuracy of weight removal, and the campus network search engine system can run effectively and normally.
【學(xué)位授予單位】：內(nèi)蒙古科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 王建勇,謝正茂,雷鳴,李曉明;近似鏡像網(wǎng)頁檢測算法的研究與評價(jià)[J];電子學(xué)報(bào);2000年S1期

2 白廣慧,連浩,劉悅,程學(xué)旗;網(wǎng)頁查重技術(shù)在企業(yè)數(shù)據(jù)倉庫中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2005年07期

3 陳錦言;孫濟(jì)洲;張亞平;;基于傅立葉變換的網(wǎng)頁去重算法[J];計(jì)算機(jī)應(yīng)用;2008年04期

4 董守斌;;木棉:企業(yè)級校園網(wǎng)搜索引擎[J];中國教育網(wǎng)絡(luò);2007年06期

5 孫殿哲;魏海平;陳巖;;Nutch中庖丁解牛中文分詞的實(shí)現(xiàn)與評測[J];計(jì)算機(jī)與現(xiàn)代化;2010年06期

6 胡駿;李星;;校園網(wǎng)信息資源搜索引擎的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2006年24期

7 高家利;廖曉峰;;改進(jìn)的Bloom Filter算法及其性能分析[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年03期

8 蔡建超;郭一平;王亮;;基于Lucene.Net校園網(wǎng)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2006年11期

9 張曉濱,石美紅,蔡桂洲;校園網(wǎng)搜索引擎設(shè)計(jì)[J];西安工程科技學(xué)院學(xué)報(bào);2002年03期

10 魯屹華;;校園內(nèi)網(wǎng)搜索引擎構(gòu)建的必要性分析[J];科技資訊;2012年02期

相關(guān)碩士學(xué)位論文前10條

1 牛娟娟;搜索引擎系統(tǒng)中網(wǎng)頁消重的研究與實(shí)現(xiàn)[D];河南大學(xué);2011年

2 戴支榮;基于Lucene的面向主題信息搜索系統(tǒng)的關(guān)鍵技術(shù)分析及應(yīng)用[D];武漢理工大學(xué);2011年

3 唐蓉;搜索引擎重復(fù)網(wǎng)頁檢測技術(shù)研究[D];重慶理工大學(xué);2011年

4 王慧;基于URP的校園信息化建設(shè)的研究[D];河海大學(xué);2006年

5 劉琳;校園網(wǎng)搜索引擎系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];山東大學(xué);2007年

6 于瑞萍;中文文本分類相關(guān)算法的研究與實(shí)現(xiàn)[D];西北大學(xué);2007年

7 黃波;主題搜索引擎的研究與應(yīng)用[D];成都理工大學(xué);2007年

8 寧力;搜索引擎中網(wǎng)頁查重方法的研究[D];北京化工大學(xué);2007年

9 江慧娜;中文搜索引擎的關(guān)鍵技術(shù)研究[D];北京化工大學(xué);2007年

10 曹欣;半虛擬化技術(shù)分析與研究[D];浙江大學(xué);2008年

本文編號：2500480

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2500480.html

上一篇：知識管理中的Metadata研究
下一篇：WEB搜索引擎的原理與實(shí)現(xiàn)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

校園網(wǎng)搜索引擎中網(wǎng)頁去重技術(shù)的研究