支持AJAX的分布式爬蟲系統(tǒng)的研究與實現(xiàn)

發(fā)布時間：2018-05-07 16:27

本文選題：分布式爬蟲 + AJAX�。� 參考：《華中科技大學》2013年碩士論文

【摘要】：現(xiàn)代社會互聯(lián)網(wǎng)技術(shù)日新月異，互聯(lián)網(wǎng)產(chǎn)品也如雨后春筍一般層出不窮，AJAX技術(shù)越來越受到開發(fā)者的青睞。此技術(shù)對傳統(tǒng)的網(wǎng)絡(luò)爬蟲卻是不友好的，使用傳統(tǒng)的網(wǎng)頁抓取方式獲得內(nèi)容是不完整的，因此，研究支持AJAX的網(wǎng)絡(luò)爬蟲系統(tǒng)，具有重要的現(xiàn)實意義。本文首先調(diào)研了異步加載網(wǎng)頁獲取方式的國內(nèi)外研究現(xiàn)狀，闡述了其收錄困難的原因，分析了當前抓取方案的優(yōu)勢和不足，提出了調(diào)用瀏覽器接口請求并獲取網(wǎng)頁的方案。其次，為了提高網(wǎng)頁抓取效率，協(xié)調(diào)好AJAX爬蟲和靜態(tài)網(wǎng)頁爬蟲的資源調(diào)配，本文提出了一個網(wǎng)頁屬性分類器的解決方案，此方案能通過網(wǎng)頁處理模塊的正文抽取結(jié)果反饋并修正分類結(jié)果，根據(jù)分類結(jié)果對不同的網(wǎng)頁采取不同的抓取方法。最后，為了維護分布式系統(tǒng)的健康運行，系統(tǒng)設(shè)計了心跳信息監(jiān)測模塊，，此模塊將收集分布式系統(tǒng)的心跳信息并統(tǒng)計分析系統(tǒng)健康度。本文所研究和實現(xiàn)的支持AJAX的分布式爬蟲系統(tǒng)，能夠收錄異步加載的網(wǎng)頁和普通的靜態(tài)頁面，并能夠?qū)崿F(xiàn)抓取任務高效的分配，為異步加載網(wǎng)頁的抓取提供了新思路。系統(tǒng)測試結(jié)果表明預期功能得以實現(xiàn)，并達到了較好的性能指標。
[Abstract]:With the rapid development of Internet technology in modern society, Internet products are springing up one after another. Ajax technology is becoming more and more popular with developers. This technique is not friendly to the traditional web crawlers, and it is not complete to use the traditional web crawling method to obtain the content. Therefore, it is of great practical significance to study the web crawler system supporting AJAX. This paper first investigates the current research situation of asynchronous loading web page acquisition at home and abroad, expounds the reasons for its difficulty, analyzes the advantages and disadvantages of the current grab scheme, and puts forward a scheme of calling browser interface request and obtaining web page. Secondly, in order to improve the efficiency of web crawling and coordinate the resource allocation of AJAX crawler and static web crawler, this paper proposes a solution of web property classifier. This scheme can extract the result feedback from the text of the web page processing module and correct the classification result. According to the classification result, different grab methods can be adopted for different web pages. Finally, in order to maintain the healthy operation of the distributed system, a heartbeat monitoring module is designed, which will collect the heartbeat information of the distributed system and analyze the health degree of the system. The distributed crawler system supported by AJAX, which is researched and implemented in this paper, can collect the pages loaded asynchronously and static pages, and can realize the efficient assignment of crawling tasks, which provides a new way of thinking for the crawling of pages loaded asynchronously. The system test results show that the expected function can be achieved and achieve better performance.
【學位授予單位】：華中科技大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP393.092;TP391.1

【參考文獻】

相關(guān)期刊論文前10條

1 李莉莎;;關(guān)于NOSQL的思考[J];中國傳媒科技;2010年04期

2 孫立偉;何國輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲技術(shù)的研究[J];電腦知識與技術(shù);2010年15期

3 張華強;;關(guān)系型數(shù)據(jù)庫與NoSQL數(shù)據(jù)庫[J];電腦知識與技術(shù);2011年20期

4 范凱;;NoSQL數(shù)據(jù)庫綜述[J];程序員;2010年06期

5 陳奇;DOM的現(xiàn)狀及其發(fā)展趨勢[J];計算機工程;2001年10期

6 賀廣宜,羅莉;分布式搜索引擎的設(shè)計與實現(xiàn)[J];計算機應用;2003年05期

7 王映,于滿泉,李盛韜,王斌,余智華;JavaScript引擎在動態(tài)網(wǎng)頁采集技術(shù)中的應用[J];計算機應用;2004年02期

8 姜明強,顧君忠;基于DOM的結(jié)構(gòu)化搜索引擎[J];計算機應用研究;2000年06期

9 楊娟;;數(shù)據(jù)庫系統(tǒng)發(fā)展和應用綜述[J];計算機與網(wǎng)絡(luò);2005年23期

10 孫彬;王東;李娟;;基于XQuery的Deep Web搜索系統(tǒng)的設(shè)計與實現(xiàn)[J];科學技術(shù)與工程;2007年16期

本文編號：1857629

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1857629.html

上一篇：信息檢索模型研究綜述
下一篇：復雜網(wǎng)絡(luò)環(huán)境中蠕蟲的傳播策略研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

支持AJAX的分布式爬蟲系統(tǒng)的研究與實現(xiàn)