校園網(wǎng)搜索引擎中網(wǎng)頁(yè)去重技術(shù)的研究
[Abstract]:With the rapid development of campus network construction, campus network information resources increase rapidly, which makes it difficult for teachers and students to locate valuable information quickly, waste time and low efficiency. Based on the characteristics of campus network, the developed general search engine can not be fully suitable for campus network, and a large number of reprinted web pages cause too many repeated pages of retrieval results. By analyzing the characteristics of campus network web pages and the existing de-emphasis technology, in order to solve the problem of excessive repeated web pages in campus network search engine retrieval results, aiming at different types of repeated web pages, the campus network search engine is constructed by using the strategy of index and real-time retrieval, and the following work has been done: first, the preparation of web pages is studied and analyzed. Firstly, the causes of web page noise, the definition and type of noise are analyzed, and the noise removal and text extraction of the original web page set are carried out by using the merged content block technology to obtain the text content of the web page. Secondly, the Chinese word segmentation technology is studied, and the existing Chinese word segmentation technology is compared. finally, the second development of Nutch is carried out, which is to modify the Nutch source code and realize the Chinese word segmentation. Secondly, the algorithm of web page de-weight in index is studied and improved. The existing algorithms are analyzed, and the longest paragraph signed page de-repetition algorithm is used for completely repeated or partially repeated web pages. Firstly, the whole document is signed, then the filtered document is segmented, the divided paragraphs are sorted, and then the first N paragraphs are fingerprint signed as the characteristics of the document. When the same number of drops in the two documents exceeds a threshold given by the system, the two documents are determined to be duplicated documents. The computational complexity is greatly reduced by extracting the first N segments and sorting the fingerprints. The experimental results show that the method has a high accuracy of weight removal. Thirdly, the optimized Fourier transform algorithm is used in real-time retrieval for repeated web pages which are slightly modified when the page is reprinted. The algorithm maps each word of each document to a numerical Fingerprint, so that each document can be represented as a discrete numerical sequence. The Fourier coefficients are obtained by Fourier transform of the numerical sequence, and the similarities between the two sequences can be roughly compared by comparing the first several terms of the coefficients. The experimental results show that the algorithm based on optimized Fourier transform can take into account the recall rate and the recall rate when the web page is modified. Taking Nutch as the development tool of the system, the algorithm of index is realized by modifying the source code of Nutch, and the algorithm of web page de-weight is realized in the form of plug-in. The campus network search engine is designed and implemented on the basis of Nutch, and the development process and method of campus network search engine system are explained in detail. Finally, the experimental performance of the proposed weight removal strategy is tested, and the Nutch crawling campus network web page is used as the experimental data set. The results show that the combination of the two algorithms improves the accuracy of search results and the accuracy of weight removal, and the campus network search engine system can run effectively and normally.
【學(xué)位授予單位】:內(nèi)蒙古科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王建勇,謝正茂,雷鳴,李曉明;近似鏡像網(wǎng)頁(yè)檢測(cè)算法的研究與評(píng)價(jià)[J];電子學(xué)報(bào);2000年S1期
2 白廣慧,連浩,劉悅,程學(xué)旗;網(wǎng)頁(yè)查重技術(shù)在企業(yè)數(shù)據(jù)倉(cāng)庫(kù)中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2005年07期
3 陳錦言;孫濟(jì)洲;張亞平;;基于傅立葉變換的網(wǎng)頁(yè)去重算法[J];計(jì)算機(jī)應(yīng)用;2008年04期
4 董守斌;;木棉:企業(yè)級(jí)校園網(wǎng)搜索引擎[J];中國(guó)教育網(wǎng)絡(luò);2007年06期
5 孫殿哲;魏海平;陳巖;;Nutch中庖丁解牛中文分詞的實(shí)現(xiàn)與評(píng)測(cè)[J];計(jì)算機(jī)與現(xiàn)代化;2010年06期
6 胡駿;李星;;校園網(wǎng)信息資源搜索引擎的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2006年24期
7 高家利;廖曉峰;;改進(jìn)的Bloom Filter算法及其性能分析[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年03期
8 蔡建超;郭一平;王亮;;基于Lucene.Net校園網(wǎng)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2006年11期
9 張曉濱,石美紅,蔡桂洲;校園網(wǎng)搜索引擎設(shè)計(jì)[J];西安工程科技學(xué)院學(xué)報(bào);2002年03期
10 魯屹華;;校園內(nèi)網(wǎng)搜索引擎構(gòu)建的必要性分析[J];科技資訊;2012年02期
相關(guān)碩士學(xué)位論文 前10條
1 牛娟娟;搜索引擎系統(tǒng)中網(wǎng)頁(yè)消重的研究與實(shí)現(xiàn)[D];河南大學(xué);2011年
2 戴支榮;基于Lucene的面向主題信息搜索系統(tǒng)的關(guān)鍵技術(shù)分析及應(yīng)用[D];武漢理工大學(xué);2011年
3 唐蓉;搜索引擎重復(fù)網(wǎng)頁(yè)檢測(cè)技術(shù)研究[D];重慶理工大學(xué);2011年
4 王慧;基于URP的校園信息化建設(shè)的研究[D];河海大學(xué);2006年
5 劉琳;校園網(wǎng)搜索引擎系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];山東大學(xué);2007年
6 于瑞萍;中文文本分類(lèi)相關(guān)算法的研究與實(shí)現(xiàn)[D];西北大學(xué);2007年
7 黃波;主題搜索引擎的研究與應(yīng)用[D];成都理工大學(xué);2007年
8 寧力;搜索引擎中網(wǎng)頁(yè)查重方法的研究[D];北京化工大學(xué);2007年
9 江慧娜;中文搜索引擎的關(guān)鍵技術(shù)研究[D];北京化工大學(xué);2007年
10 曹欣;半虛擬化技術(shù)分析與研究[D];浙江大學(xué);2008年
本文編號(hào):2500480
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2500480.html