搜索引擎系統(tǒng)中網(wǎng)頁消重的研究與實現(xiàn)

發(fā)布時間：2018-10-30 13:37

【摘要】：隨著計算機硬件軟件和互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,網(wǎng)絡(luò)上的各種信息急劇增長,已經(jīng)成為人類有史以來信息資源數(shù)量最多、信息資源種類最全、信息資源規(guī)模最大的一個綜合信息資源庫。然而,用戶在互聯(lián)網(wǎng)上查找需要信息的時候,只知道搜索的關(guān)鍵詞,并不知道具體的URL,因此就需要借助搜索引擎幫助用戶查找需要的信息。搜索引擎可以方便用戶從互聯(lián)網(wǎng)上查找信息,節(jié)約用戶時間,受到大家普遍歡迎�；ヂ�(lián)網(wǎng)上出現(xiàn)很多功能強大的搜索引擎,針對漢語的Baidu和針對多種語言的Google等。然而,有些網(wǎng)站因為商業(yè)利益,為了提高其網(wǎng)站的點擊率,大量轉(zhuǎn)載別的文章。好的文章也會在博客和論壇之間轉(zhuǎn)載。而且出現(xiàn)熱門事件和大眾感興趣的焦點話題后,會有很多網(wǎng)站竟相報道和轉(zhuǎn)載,使得用戶從搜索引擎返回的結(jié)果會有很多鏈接不同但內(nèi)容相同,降低了用戶體驗。用戶不得不在大批相同的結(jié)果集中尋找自己需要的信息,而且重復(fù)網(wǎng)頁的存在也增加了索引數(shù)據(jù)庫的存儲容量。去除重復(fù)的網(wǎng)頁是提高搜索引擎實用性和效率的一個途徑。本文首先在基于HTML標(biāo)簽的最大正文塊算法基礎(chǔ)上實現(xiàn)了網(wǎng)頁主題內(nèi)容的提取,并在此基礎(chǔ)上,提出了基于關(guān)鍵詞和特征碼的頁面去重算法,并開發(fā)了實驗系統(tǒng),對該算法進行了驗證,通過對實驗結(jié)果的分析討論證明了算法的有效性。本文的主要工作有以下幾點: 1.理論研究:分析了搜索引擎運行原理與關(guān)鍵技術(shù),從文本的相似檢測到網(wǎng)頁相似檢測領(lǐng)域中幾個經(jīng)典的去重算法。 2.網(wǎng)頁去重與文本去重并不完全相同,需要先提取出去除導(dǎo)航、廣告、版權(quán)等網(wǎng)頁噪聲的網(wǎng)頁主題內(nèi)容,在基于HTML標(biāo)簽的最大正文塊算法基礎(chǔ)上,綜合考慮各種類型的網(wǎng)頁,設(shè)計算法實現(xiàn)了網(wǎng)頁主題內(nèi)容提取。 3.算法改進:在提取出的網(wǎng)頁主題內(nèi)容基礎(chǔ)上,綜合考慮了三種經(jīng)典的網(wǎng)頁去重算法:基于特征碼,特征句和KCC算法,借鑒其優(yōu)勢,提出了基于關(guān)鍵詞和特征碼的網(wǎng)頁去重算法。該算法簡單高效,可以有效識別轉(zhuǎn)載過程中有微小改動的網(wǎng)頁,提高了網(wǎng)頁去重的準(zhǔn)確性。 4.設(shè)計實現(xiàn):在開源框架lucene基礎(chǔ)上實現(xiàn)了一個簡單的單機版搜索引擎系統(tǒng),將基于關(guān)鍵詞和特征碼算法內(nèi)嵌到去重模塊。該系統(tǒng)可以根據(jù)需要抓取網(wǎng)頁、對網(wǎng)頁進行去重處理、對去重后的網(wǎng)頁建立索引并進行搜索,根據(jù)用戶查詢關(guān)鍵詞返回相關(guān)結(jié)果。 5.實驗分析:將本文去重算法內(nèi)嵌到搜索引擎系統(tǒng)中,對抓取的900個含重復(fù)網(wǎng)頁的數(shù)據(jù)集進行去重處理,并分析實驗結(jié)果,證明改進算法的有效性。
[Abstract]:With the rapid development of computer hardware and software and Internet technology, all kinds of information on the network have increased rapidly, which has become the largest number of information resources and the most complete type of information resources in human history. A comprehensive information resource database with the largest scale of information resources. However, when users look up the information needed on the Internet, they only know the key words of search, and do not know the specific URL,. Therefore, they need to use search engines to help users find the information they need. Search engine can be convenient for users to find information from the Internet, save user time, is generally welcomed by everyone. There are many powerful search engines on the Internet, such as Baidu for Chinese and Google for many languages. However, some sites because of commercial interests, in order to improve the click rate of their websites, reprint other articles. Good articles will also be reprinted between blogs and forums. And after the hot events and the hot topics of public interest, there will be many websites to report and reprint each other, which will result in many different links but the same content from the search engine, which reduces the user experience. Users have to search for the information they need in a large number of the same result sets, and the existence of duplicate pages also increases the storage capacity of the index database. Removing duplicate web pages is a way to improve the usability and efficiency of search engines. In this paper, we first implement the extraction of web page topic content on the basis of the maximum text block algorithm based on HTML tag, and on this basis, we propose a page de-reduplication algorithm based on keyword and signature, and develop an experimental system. The validity of the algorithm is proved by the analysis and discussion of the experimental results. The main work of this paper is as follows: 1. Theoretical study: the principle and key technologies of search engine are analyzed. From text similarity detection to web page similarity detection, several classical algorithms are proposed. 2. Web pages are not exactly the same as text pages. It is necessary to extract the subject content of web pages that remove the noise of navigation, advertising, copyright and other web pages. Based on the algorithm of maximum text block based on HTML tags, various types of web pages should be considered synthetically. The algorithm is designed to extract the topic content of the web page. 3. Algorithm improvement: on the basis of extracting web page topic content, three classical web page de-duplication algorithms are considered synthetically: based on signature, feature sentence and KCC algorithm, using their advantages for reference, this paper puts forward a page de-duplication algorithm based on keyword and signature. The algorithm is simple and efficient, which can effectively identify the pages with minor changes in the reprint process, and improve the accuracy of the web pages. 4. Design and implementation: on the basis of open source framework lucene, a simple single-machine search engine system is implemented, which is based on keyword and signature algorithm embedded in the de-reduplication module. The system can grab the web page according to the need, dereprocess the page, build the index and search the removed page, and return the relevant results according to the user's query key words. 5. Experimental analysis: the algorithm is embedded in the search engine system. 900 data sets with duplicate pages are removed and the experimental results are analyzed to prove the effectiveness of the improved algorithm.
【學(xué)位授予單位】：河南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2011
【分類號】：TP393.092

【引證文獻】

相關(guān)期刊論文前1條

1 程們森;安俊秀;;基于特征詞群的新聞類重復(fù)網(wǎng)頁和近似網(wǎng)頁識別算法[J];成都信息工程學(xué)院學(xué)報;2012年04期

相關(guān)碩士學(xué)位論文前1條

1 張芳;校園網(wǎng)搜索引擎中網(wǎng)頁去重技術(shù)的研究[D];內(nèi)蒙古科技大學(xué);2012年

，

本文編號：2300151

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2300151.html

上一篇：目的論視角下的漢語廣告英譯研究
下一篇：移動互聯(lián)網(wǎng)的用戶行為分析系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

搜索引擎系統(tǒng)中網(wǎng)頁消重的研究與實現(xiàn)