天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

動態(tài)自適應的資源采集系統(tǒng)的設計與實現(xiàn)

發(fā)布時間:2018-08-24 15:24
【摘要】:當今,互聯(lián)網(wǎng)提供了越來越多有價值的信息,人們習慣通過搜索引擎來獲取信息。中國的網(wǎng)頁總數(shù)在2012年比2011年增長了近41%,這對搜索引擎的網(wǎng)絡資源采集提出了更高的要求。互聯(lián)網(wǎng)的網(wǎng)頁數(shù)量很龐大,尤其是動態(tài)網(wǎng)頁的數(shù)量增長迅速。在資源采集的過程中,難免會碰到各種異常情況,如服務器響應緩慢,重復網(wǎng)頁、無效網(wǎng)頁鏈接過多,網(wǎng)頁資源之間的鏈接關系難以發(fā)現(xiàn)等問題。本文重點研究這類問題的解決辦法。 本文主要研究目標是設計并實現(xiàn)一個資源采集系統(tǒng),不僅能夠動態(tài)調(diào)整和自動適應廣域網(wǎng)中的各種異常情況,而且能基于已有采集信息發(fā)現(xiàn)網(wǎng)頁之間的鏈接關系,預測出更多相似網(wǎng)頁。本文中,系統(tǒng)將采集過程中的實時統(tǒng)計信息,作為實時過濾鏈接的依據(jù),旨在過濾重復率高、訪問無效、訪問超時的網(wǎng)頁鏈接,以提高系統(tǒng)的采集效率。與一般的采集系統(tǒng)相比,本系統(tǒng)可以較好地適應了不穩(wěn)定的網(wǎng)絡狀況和較好地處理大量垃圾鏈接的問題。本文針對難以發(fā)現(xiàn)網(wǎng)頁鏈接的問題,提出了鏈接分析預測的方法,采用了在分析鏈接統(tǒng)計信息的基礎上進行預測的方式,取得了發(fā)現(xiàn)大量相似網(wǎng)頁、擴大采集覆蓋范圍的效果,,彌補了抽取鏈接的常規(guī)方法的不足。 本文采用分布式架構設計來實現(xiàn)資源采集系統(tǒng),除了劃分并實現(xiàn)了網(wǎng)頁下載、網(wǎng)頁解析、URL消重、URL調(diào)度等基本模塊以外,還加入實時過濾模塊和URL預測模塊,以及統(tǒng)計信息、URL聚類、分類等輔助模塊,使得系統(tǒng)具備動態(tài)自適應特性。 測試表明,本文提出的方法能夠識別各種異常采集狀況的發(fā)生并自適應地進行調(diào)整,提高了系統(tǒng)的健壯性,保證了采集過程的穩(wěn)定。針對難以發(fā)現(xiàn)的網(wǎng)頁鏈接,系統(tǒng)能夠進行有效預測,除了常規(guī)抽取鏈接以外,本文提供了發(fā)現(xiàn)網(wǎng)頁鏈接的另一個有效途徑。
[Abstract]:Nowadays, the Internet provides more and more valuable information. The total number of web pages in China increased by nearly 41% in 2012 compared with 2011, which puts forward higher requirements for the collection of web resources by search engines. The number of web pages on the Internet is huge, especially the number of dynamic pages. In the process of resource acquisition, it is inevitable to encounter various abnormal situations, such as slow response of server, repeated pages, too many invalid web page links, and the link relationship between web resources is difficult to find, and so on. This paper focuses on the solution of this kind of problem. The main research goal of this paper is to design and implement a resource acquisition system, which can not only dynamically adjust and automatically adapt to all kinds of anomalies in WAN, but also discover the link relationship between web pages based on the information collected. Predict more similar pages. In this paper, the system takes real-time statistical information in the process of collection as the basis for real-time filtering links, aiming at filtering web links with high repetition rate, invalid access and time-out access, so as to improve the efficiency of the system. Compared with the general collection system, the system can adapt to the unstable network conditions and deal with the problem of a large number of spam links. In this paper, the method of link analysis and prediction is put forward, which is based on the analysis of the statistical information of the link, and the method of finding a large number of similar pages and extending the coverage of the collection is obtained. It makes up for the deficiency of the conventional method of extracting links. In this paper, the distributed architecture is used to realize the resource acquisition system. Besides the basic modules of web page download, web page analysis and URL reshuffle scheduling, real-time filtering module and URL prediction module are also added. As well as the statistical information URL clustering, classification and other auxiliary modules, make the system has dynamic adaptive characteristics. The test results show that the method proposed in this paper can recognize the occurrence of various abnormal sampling conditions and adaptively adjust, improve the robustness of the system and ensure the stability of the acquisition process. The system can make effective prediction for the hard to find web links. In addition to the conventional extraction of links, this paper provides another effective way to find web links.
【學位授予單位】:華南理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP393.092;TP391.3

【參考文獻】

相關期刊論文 前10條

1 雷鳴,王建勇,趙江華,單松巍,陳葆玨;第三代搜索引擎與天網(wǎng)二期[J];北京大學學報(自然科學版);2001年05期

2 陳鵬;呂衛(wèi)鋒;;一種基于有效修剪的最大頻繁項集挖掘算法[J];北京航空航天大學學報;2006年02期

3 王新;;搜索方法中的剪枝優(yōu)化[J];電腦知識與技術(學術交流);2007年11期

4 李振星,徐澤平,唐衛(wèi)清,唐榮錫;基于興趣模型的WEB信息預測采集過濾方法[J];計算機工程與應用;2003年05期

5 周德懋;李舟軍;;高性能網(wǎng)絡爬蟲:研究綜述[J];計算機科學;2009年08期

6 楊文峰,李星;網(wǎng)絡搜索引擎的用戶查詢分析[J];計算機工程;2001年06期

7 汪濤,樊孝忠;鏈接分析對主題爬蟲的改進[J];計算機應用;2004年S2期

8 董守斌;;木棉:企業(yè)級校園網(wǎng)搜索引擎[J];中國教育網(wǎng)絡;2007年06期

9 馬志新,陳曉云,王雪,李龍杰;最大頻繁項集挖掘中搜索空間的剪枝策略[J];清華大學學報(自然科學版);2005年S1期

10 周開波;孟艾立;王小雨;谷金雷;魯旭;;影響互聯(lián)網(wǎng)網(wǎng)速的因素[J];現(xiàn)代電信科技;2012年09期



本文編號:2201235

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2201235.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶3bc42***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com