天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

搜索引擎中重復網(wǎng)頁檢測算法研究

發(fā)布時間:2018-05-12 01:15

  本文選題:搜索引擎 + 重復網(wǎng)頁檢測。 參考:《河南工業(yè)大學》2012年碩士論文


【摘要】:隨著因特網(wǎng)的普及和快速發(fā)展,網(wǎng)絡信息以指數(shù)級速度快速增長,搜索引擎成為用戶在海量網(wǎng)絡資源中查找需求信息的有效工具。但是由于網(wǎng)絡信息發(fā)布沒有明確統(tǒng)一的規(guī)范,而且發(fā)布信息比較容易,造成因特網(wǎng)上存在有大量內(nèi)容重復和近似重復的網(wǎng)頁。這些重復網(wǎng)頁會給搜索引擎帶來諸多弊端,如影響用戶體驗,浪費抓取和存儲資源,增大倒排索引表和降低檢索效率等,因此重復網(wǎng)頁檢測技術可以有效提高搜索引擎的質(zhì)量。 近年來,各大搜索引擎公司和國內(nèi)外學者提出了多種重復網(wǎng)頁檢測算法,如基于特征碼的算法、I-Match算法、基于特征項的重復網(wǎng)頁檢測算法和DSC重復網(wǎng)頁檢測算法等。論文對現(xiàn)有的重復網(wǎng)頁檢測算法進行詳細分析發(fā)現(xiàn),這些算法的共同思想是首先從文本中抽取一定信息,其次利用抽取出的信息進行相似性判定。不同算法在具體抽取文本信息時的策略不同,導致計算相似性時的方法不同。并且有些算法為了提高計算的效率,對抽取的文本信息進行壓縮處理?梢娔芊駨奈谋緝(nèi)容中抽取有效信息準確表征文本是影響重復網(wǎng)頁檢測技術性能的關鍵因素。 論文對兩種經(jīng)典的重復網(wǎng)頁檢測算法進行了詳細的分析,并對其中存在的不足進行改進,主要研究內(nèi)容如下: (1)基于DSC重復網(wǎng)頁檢測算法的改進 DSC(Digital Syntactic Clustering)算法是用于重復網(wǎng)頁檢測的經(jīng)典算法,其基本思想是將文本切分成一定數(shù)量的shingles,然后選取一定的shingles參與相似性比較。該算法的缺點是在選取shingles時是隨機的,并沒有充分利用文本的內(nèi)容特征。針對算法的不足,改進算法維護一個特征項的集合,選取含有特征項的shingles,這樣參與相似性比較的shingles能更好的利用文本的結構特征和內(nèi)容特征。 (2)基于特征項的重復網(wǎng)頁檢測算法的改進 基于特征項的重復網(wǎng)頁檢測算法首先利用傳統(tǒng)信息檢索中的TFIDF算法抽取文本的特征項,將文本表示成特征項的空間向量,然后利用余弦公式判定相似性。TFIDF算法的缺點是在計算特征項的權重時沒有考慮特征項在文本中的位置信息。通過對網(wǎng)頁的觀察發(fā)現(xiàn),,網(wǎng)頁文本的內(nèi)容較短,較多含有標題,并且標題是內(nèi)容的高度概括。利用這一特點,對TFIDF算法進行改進,對在文本標題中出現(xiàn)的特征項的權重進行了增強。 (3)改進算法的性能評估 實現(xiàn)了一個基于開源索引檢索工具Lucene的搜索引擎原型系統(tǒng),對改進算法進行性能驗證。實驗結果表明,改進算法在重復網(wǎng)頁識別的查全率和查準率方面較原算法都有所提升。
[Abstract]:With the popularization and rapid development of the Internet, the network information is growing exponentially, and the search engine has become an effective tool for users to find the demand information in the massive network resources. However, there is no clear and uniform specification for the information release on the Internet, and it is easy to publish the information, which results in the existence of a large number of web pages with repeated content and similar duplication on the Internet. These repeated pages will bring many disadvantages to search engine, such as affecting user experience, wasting grab and storage resources, increasing inverted index table and reducing retrieval efficiency, etc. Therefore, duplicate page detection technology can effectively improve the quality of search engine. In recent years, various search engine companies and scholars at home and abroad have proposed a variety of duplicate page detection algorithms, such as signature based algorithm I match algorithm, feature based repeat page detection algorithm and DSC repeat page detection algorithm and so on. Through the detailed analysis of the existing repeated page detection algorithms, it is found that the common idea of these algorithms is to extract some information from the text first, and then to use the extracted information to determine the similarity. Different algorithms have different strategies for extracting text information, which leads to different methods for computing similarity. In order to improve the computational efficiency, some algorithms compress the extracted text information. It can be seen that extracting effective information from text content accurately represents the text is the key factor to affect the performance of duplicate page detection technology. In this paper, two classical algorithms of duplicate page detection are analyzed in detail, and the shortcomings are improved. The main contents are as follows: 1) an improved algorithm for duplicate web page detection based on DSC DSC(Digital Syntactic clustering algorithm is a classical algorithm for repeated web page detection. Its basic idea is to divide the text into a certain number of shingles, and then select a certain shingles to participate in similarity comparison. The disadvantage of this algorithm is that it is random in selecting shingles and does not make full use of the content features of the text. In view of the deficiency of the algorithm, the improved algorithm maintains a set of feature items and selects Shingleses with feature items, so that the shingles which takes part in the similarity comparison can make better use of the structural features and content features of the text. Improvement of the algorithm of duplicate Web Page Detection based on feature item Firstly, the TFIDF algorithm of traditional information retrieval is used to extract the feature items of the text, and the text is represented as the spatial vector of the feature item. Then the disadvantage of using cosine formula to determine similarity. TFIDF algorithm is that the location information of feature items in text is not considered when calculating the weight of feature items. Through the observation of the web page, it is found that the content of the page text is shorter, the content contains more titles, and the title is the high generalization of the content. Using this feature, the TFIDF algorithm is improved, and the weight of the feature items appearing in the text title is enhanced. Performance evaluation of improved algorithm A prototype system of search engine based on open source index retrieval tool Lucene is implemented to verify the performance of the improved algorithm. The experimental results show that the improved algorithm can improve the recall rate and precision rate of duplicate page recognition compared with the original algorithm.
【學位授予單位】:河南工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3

【參考文獻】

相關期刊論文 前9條

1 王建勇,謝正茂,雷鳴,李曉明;近似鏡像網(wǎng)頁檢測算法的研究與評價[J];電子學報;2000年S1期

2 孫鐵利;劉延吉;;中文分詞技術的研究現(xiàn)狀與困難[J];信息技術;2009年07期

3 馬玉春,宋瀚濤;Web中文文本分詞技術研究[J];計算機應用;2004年04期

4 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計算機應用;2009年S1期

5 郭慶琳;李艷梅;唐琦;;基于VSM的文本相似度計算的研究[J];計算機應用研究;2008年11期

6 張俊英;胡俠;卜佳俊;;網(wǎng)頁文本信息自動提取技術綜述[J];計算機應用研究;2009年08期

7 唐鐵兵;陳林;祝偉華;;基于Lucene的全文檢索構件的研究與實現(xiàn)[J];計算機應用與軟件;2010年02期

8 吳平博,陳群秀,馬亮;基于特征串的大規(guī)模中文網(wǎng)頁快速去重算法研究[J];中文信息學報;2003年02期

9 代六玲,黃河燕,陳肇雄;中文文本分類中特征抽取方法的比較研究[J];中文信息學報;2004年01期

相關碩士學位論文 前2條

1 劉運佳;基于Lucene和Heririx構建搜索引擎的研究和示例實現(xiàn)[D];電子科技大學;2008年

2 萬晶;Web網(wǎng)頁正文抽取方法研究[D];南昌大學;2010年



本文編號:1876461

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1876461.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶581e2***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
不卡一区二区在线视频| 国产麻豆精品福利在线| 欧美日韩精品久久亚洲区熟妇人| 午夜精品一区免费视频| 国产美女精品午夜福利视频 | 欧美胖熟妇一区二区三区| 久久热在线视频免费观看| 欧美在线视频一区观看| 国产精品伦一区二区三区在线 | 亚洲综合色在线视频香蕉视频 | 空之色水之色在线播放| 在线观看欧美视频一区| 精品人妻一区二区三区在线看| 国产午夜精品亚洲精品国产| 国产精品伦一区二区三区四季| 精品欧美一区二区三久久| 欧美一区二区三区十区| 欧美一区二区三区喷汁尤物| 日本一本不卡免费视频| 午夜福利激情性生活免费视频| 久久免费精品拍拍一区二区| 亚洲中文字幕在线观看黑人| 黑丝国产精品一区二区| 欧美色欧美亚洲日在线| 日本久久精品在线观看| 久久碰国产一区二区三区| 亚洲人午夜精品射精日韩 | 国内外激情免费在线视频| 99热九九在线中文字幕| 精品国自产拍天天青青草原| 高清亚洲精品中文字幕乱码| 青青操视频在线播放免费| 亚洲精品小视频在线观看| 少妇高潮呻吟浪语91| 国产又粗又猛又爽又黄| 色婷婷激情五月天丁香| 欧美日韩国产二三四区| 国产精品国三级国产专不卡| 亚洲欧美日韩中文字幕二欧美| 日本av一区二区不卡| 国产av精品高清一区二区三区|