搜索引擎中重復網(wǎng)頁檢測算法研究

發(fā)布時間：2018-05-12 01:15

本文選題：搜索引擎 + 重復網(wǎng)頁檢測　；參考：《河南工業(yè)大學》2012年碩士論文

【摘要】：隨著因特網(wǎng)的普及和快速發(fā)展，網(wǎng)絡信息以指數(shù)級速度快速增長，搜索引擎成為用戶在海量網(wǎng)絡資源中查找需求信息的有效工具。但是由于網(wǎng)絡信息發(fā)布沒有明確統(tǒng)一的規(guī)范，而且發(fā)布信息比較容易，造成因特網(wǎng)上存在有大量內(nèi)容重復和近似重復的網(wǎng)頁。這些重復網(wǎng)頁會給搜索引擎帶來諸多弊端，如影響用戶體驗，浪費抓取和存儲資源，增大倒排索引表和降低檢索效率等，因此重復網(wǎng)頁檢測技術(shù)可以有效提高搜索引擎的質(zhì)量。近年來，各大搜索引擎公司和國內(nèi)外學者提出了多種重復網(wǎng)頁檢測算法，如基于特征碼的算法、I-Match算法、基于特征項的重復網(wǎng)頁檢測算法和DSC重復網(wǎng)頁檢測算法等。論文對現(xiàn)有的重復網(wǎng)頁檢測算法進行詳細分析發(fā)現(xiàn)，這些算法的共同思想是首先從文本中抽取一定信息，其次利用抽取出的信息進行相似性判定。不同算法在具體抽取文本信息時的策略不同，導致計算相似性時的方法不同。并且有些算法為了提高計算的效率，對抽取的文本信息進行壓縮處理�？梢娔芊駨奈谋緝�(nèi)容中抽取有效信息準確表征文本是影響重復網(wǎng)頁檢測技術(shù)性能的關(guān)鍵因素。論文對兩種經(jīng)典的重復網(wǎng)頁檢測算法進行了詳細的分析，并對其中存在的不足進行改進，主要研究內(nèi)容如下：（1）基于DSC重復網(wǎng)頁檢測算法的改進 DSC(Digital Syntactic Clustering)算法是用于重復網(wǎng)頁檢測的經(jīng)典算法，其基本思想是將文本切分成一定數(shù)量的shingles，然后選取一定的shingles參與相似性比較。該算法的缺點是在選取shingles時是隨機的，并沒有充分利用文本的內(nèi)容特征。針對算法的不足，改進算法維護一個特征項的集合，選取含有特征項的shingles，這樣參與相似性比較的shingles能更好的利用文本的結(jié)構(gòu)特征和內(nèi)容特征。（2）基于特征項的重復網(wǎng)頁檢測算法的改進基于特征項的重復網(wǎng)頁檢測算法首先利用傳統(tǒng)信息檢索中的TFIDF算法抽取文本的特征項，將文本表示成特征項的空間向量，然后利用余弦公式判定相似性。TFIDF算法的缺點是在計算特征項的權(quán)重時沒有考慮特征項在文本中的位置信息。通過對網(wǎng)頁的觀察發(fā)現(xiàn)，，網(wǎng)頁文本的內(nèi)容較短，較多含有標題，并且標題是內(nèi)容的高度概括。利用這一特點，對TFIDF算法進行改進，對在文本標題中出現(xiàn)的特征項的權(quán)重進行了增強。（3）改進算法的性能評估實現(xiàn)了一個基于開源索引檢索工具Lucene的搜索引擎原型系統(tǒng)，對改進算法進行性能驗證。實驗結(jié)果表明，改進算法在重復網(wǎng)頁識別的查全率和查準率方面較原算法都有所提升。
[Abstract]:With the popularization and rapid development of the Internet, the network information is growing exponentially, and the search engine has become an effective tool for users to find the demand information in the massive network resources. However, there is no clear and uniform specification for the information release on the Internet, and it is easy to publish the information, which results in the existence of a large number of web pages with repeated content and similar duplication on the Internet. These repeated pages will bring many disadvantages to search engine, such as affecting user experience, wasting grab and storage resources, increasing inverted index table and reducing retrieval efficiency, etc. Therefore, duplicate page detection technology can effectively improve the quality of search engine. In recent years, various search engine companies and scholars at home and abroad have proposed a variety of duplicate page detection algorithms, such as signature based algorithm I match algorithm, feature based repeat page detection algorithm and DSC repeat page detection algorithm and so on. Through the detailed analysis of the existing repeated page detection algorithms, it is found that the common idea of these algorithms is to extract some information from the text first, and then to use the extracted information to determine the similarity. Different algorithms have different strategies for extracting text information, which leads to different methods for computing similarity. In order to improve the computational efficiency, some algorithms compress the extracted text information. It can be seen that extracting effective information from text content accurately represents the text is the key factor to affect the performance of duplicate page detection technology. In this paper, two classical algorithms of duplicate page detection are analyzed in detail, and the shortcomings are improved. The main contents are as follows: 1) an improved algorithm for duplicate web page detection based on DSC DSC(Digital Syntactic clustering algorithm is a classical algorithm for repeated web page detection. Its basic idea is to divide the text into a certain number of shingles, and then select a certain shingles to participate in similarity comparison. The disadvantage of this algorithm is that it is random in selecting shingles and does not make full use of the content features of the text. In view of the deficiency of the algorithm, the improved algorithm maintains a set of feature items and selects Shingleses with feature items, so that the shingles which takes part in the similarity comparison can make better use of the structural features and content features of the text. Improvement of the algorithm of duplicate Web Page Detection based on feature item Firstly, the TFIDF algorithm of traditional information retrieval is used to extract the feature items of the text, and the text is represented as the spatial vector of the feature item. Then the disadvantage of using cosine formula to determine similarity. TFIDF algorithm is that the location information of feature items in text is not considered when calculating the weight of feature items. Through the observation of the web page, it is found that the content of the page text is shorter, the content contains more titles, and the title is the high generalization of the content. Using this feature, the TFIDF algorithm is improved, and the weight of the feature items appearing in the text title is enhanced. Performance evaluation of improved algorithm A prototype system of search engine based on open source index retrieval tool Lucene is implemented to verify the performance of the improved algorithm. The experimental results show that the improved algorithm can improve the recall rate and precision rate of duplicate page recognition compared with the original algorithm.
【學位授予單位】：河南工業(yè)大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP391.3

【參考文獻】

相關(guān)期刊論文前9條

1 王建勇,謝正茂,雷鳴,李曉明;近似鏡像網(wǎng)頁檢測算法的研究與評價[J];電子學報;2000年S1期

2 孫鐵利;劉延吉;;中文分詞技術(shù)的研究現(xiàn)狀與困難[J];信息技術(shù);2009年07期

3 馬玉春,宋瀚濤;Web中文文本分詞技術(shù)研究[J];計算機應用;2004年04期

4 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計算機應用;2009年S1期

5 郭慶琳;李艷梅;唐琦;;基于VSM的文本相似度計算的研究[J];計算機應用研究;2008年11期

6 張俊英;胡俠;卜佳俊;;網(wǎng)頁文本信息自動提取技術(shù)綜述[J];計算機應用研究;2009年08期

7 唐鐵兵;陳林;祝偉華;;基于Lucene的全文檢索構(gòu)件的研究與實現(xiàn)[J];計算機應用與軟件;2010年02期

8 吳平博,陳群秀,馬亮;基于特征串的大規(guī)模中文網(wǎng)頁快速去重算法研究[J];中文信息學報;2003年02期

9 代六玲,黃河燕,陳肇雄;中文文本分類中特征抽取方法的比較研究[J];中文信息學報;2004年01期

相關(guān)碩士學位論文前2條

1 劉運佳;基于Lucene和Heririx構(gòu)建搜索引擎的研究和示例實現(xiàn)[D];電子科技大學;2008年

2 萬晶;Web網(wǎng)頁正文抽取方法研究[D];南昌大學;2010年

本文編號：1876461

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1876461.html

上一篇：基于Key的XML連續(xù)查詢算法
下一篇：高并發(fā)搜索系統(tǒng)下內(nèi)存池的設(shè)計和實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

搜索引擎中重復網(wǎng)頁檢測算法研究