支持AJAX的定址網(wǎng)絡(luò)爬蟲系統(tǒng)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-01-27 16:32

本文關(guān)鍵詞： AJAX JaVaScript 網(wǎng)絡(luò)爬蟲數(shù)據(jù)采集定址　出處：《北京郵電大學(xué)》2013年碩士論文　論文類型：學(xué)位論文

【摘要】：在Web2.0的概念出現(xiàn)后,一種被稱為RIA的具有高度互動(dòng)性和豐富用戶體驗(yàn)的網(wǎng)絡(luò)應(yīng)用程序出現(xiàn)了,如博客、微博等,而AJAX技術(shù)由于符合Web2.0時(shí)代的需求,被越來越多的運(yùn)用在Web開發(fā)中。AJAX技術(shù)采用客戶端JavaScript動(dòng)態(tài)修改DOM結(jié)構(gòu),實(shí)現(xiàn)了網(wǎng)頁(yè)的無(wú)縫化重構(gòu),提高了網(wǎng)頁(yè)的互動(dòng)性、速度以及可用性。但與此同時(shí),它改變了傳統(tǒng)的Web應(yīng)用模型,打破了傳統(tǒng)爬蟲依賴于分析頁(yè)面中超鏈接的爬行模式,使傳統(tǒng)爬蟲不能采集AJAX網(wǎng)頁(yè)中的動(dòng)態(tài)內(nèi)容,這意味著大量有意義的數(shù)據(jù)無(wú)法通過搜索引擎檢索。為了解決AJAX網(wǎng)站的動(dòng)態(tài)數(shù)據(jù)采集問題,本文設(shè)計(jì)并實(shí)現(xiàn)了種支持AJAX的定址網(wǎng)絡(luò)爬蟲系統(tǒng)。首先,通過對(duì)傳統(tǒng)網(wǎng)絡(luò)爬蟲的研究,分析得出了AJAX爬蟲的技術(shù)難點(diǎn),并從一個(gè)實(shí)際的AJAX網(wǎng)站出發(fā),闡述了傳統(tǒng)爬蟲在爬行使用AJAX技術(shù)實(shí)現(xiàn)的網(wǎng)站時(shí)存在的關(guān)鍵問題以及研究應(yīng)用場(chǎng)景；其次,介紹了研究的相關(guān)概念和問題模型,并設(shè)計(jì)了系統(tǒng)運(yùn)行流程與系統(tǒng)架構(gòu)；最后,通過對(duì)AJAX爬蟲中關(guān)鍵問題的分析與設(shè)計(jì),實(shí)現(xiàn)了一種支持AJAX的定址網(wǎng)絡(luò)爬蟲系統(tǒng)。系統(tǒng)將傳統(tǒng)網(wǎng)絡(luò)爬蟲工作過程中的URL提取和下載網(wǎng)頁(yè)兩個(gè)功能分離開,使之成為兩個(gè)獨(dú)立的功能模塊。通過URL抽取模塊實(shí)現(xiàn)網(wǎng)站URL抽取,形成URL資源庫(kù)。采用Webkit渲染引擎實(shí)現(xiàn)的瀏覽器來加載HTML網(wǎng)頁(yè)并解析JavaScript代碼,并結(jié)合腳本生成器生成的JavaScript翻頁(yè)腳本,實(shí)現(xiàn)了從頁(yè)面DOM表示中識(shí)別用于頁(yè)面導(dǎo)航的頁(yè)面元素,自動(dòng)觸發(fā)頁(yè)面元素上的事件,生成并提取分頁(yè)內(nèi)容。爬蟲系統(tǒng)只采集URL資源庫(kù)中鏈接地址導(dǎo)向的網(wǎng)頁(yè)信息,也就是說爬蟲的爬行范圍完全由URL資源庫(kù)限定,是受控的,即為“定址”的網(wǎng)絡(luò)爬蟲。此外,利用三類(共六個(gè))真實(shí)網(wǎng)站,對(duì)系統(tǒng)的查全率、準(zhǔn)確度及性能進(jìn)行了測(cè)試。實(shí)驗(yàn)結(jié)果表明,本系統(tǒng)的查全率達(dá)到了100%；在不翻頁(yè)采集的情況下,平均抓取速率達(dá)到52.03kb/s,系統(tǒng)展現(xiàn)出很好的效能。研究表明,本系統(tǒng)能夠準(zhǔn)確抓取AJAX網(wǎng)站的動(dòng)態(tài)內(nèi)容,并對(duì)相似網(wǎng)頁(yè)結(jié)構(gòu)的目標(biāo)網(wǎng)頁(yè)進(jìn)行分頁(yè)數(shù)據(jù)采集,系統(tǒng)具有較高的靈活性與適用性,可用于建設(shè)垂直搜索,以及開源情報(bào)采集等。
[Abstract]:After the concept of Web2.0 emerged, a highly interactive and user experience rich web application called RIA emerged, such as blog, Weibo and so on. However, AJAX technology is more and more used in Web development because it meets the needs of Web2.0 era. Ajax technology adopts client JavaScript to dynamically modify DOM structure. It realizes the seamless reconfiguration of web pages, improves the interaction, speed and usability of web pages, but at the same time, it changes the traditional Web application model. It breaks the traditional crawler's crawling mode which relies on the hyperlink in the analysis page, and makes the traditional crawler unable to collect the dynamic content in the AJAX page, which means that a lot of meaningful data can not be retrieved through the search engine. In order to solve the problem of dynamic data acquisition of AJAX website, this paper designs and implements a kind of addressable web crawler system supporting AJAX. Firstly, through the research of traditional web crawler. This paper analyzes the technical difficulties of AJAX crawler, and starts from a practical AJAX website. This paper expounds the key problems and application scenarios of the traditional crawlers when they use AJAX technology to realize the web sites. Secondly, the related concepts and problem models of the research are introduced, and the system running flow and system architecture are designed. Finally, through the analysis and design of the key problems in AJAX crawler, an addressable web crawler system supporting AJAX is implemented. The system separates the two functions of URL extraction and web page download from the traditional web crawler working process, making it two independent function modules. The URL extraction module is implemented through the URL extraction module. Form the URL repository. Use the Webkit rendering engine to implement the browser to load the HTML pages and parse the JavaScript code. Combined with the JavaScript page turning script generated by the script generator, the page elements used for page navigation are identified from the page DOM representation, and the events on the page elements are automatically triggered. The crawler system only collects the link address oriented web page information in the URL repository, that is to say, the crawler's crawling range is completely limited by the URL resource base and is controlled. That is, the "address" of the network crawler. In addition, the recall, accuracy and performance of the system are tested by using three kinds of (six) real websites. The experimental results show that the recall rate of the system has reached 100%. The average capture rate is 52.03 kb / s without page turning, and the system shows good performance. The research shows that the system can capture the dynamic content of AJAX website accurately and collect the paging data of the target pages with similar web page structure. The system has high flexibility and applicability. Can be used to build vertical search, as well as open source intelligence collection and so on.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前8條

1 王晶;陳衛(wèi)衛(wèi);;AJAX搜索引擎研究[J];電腦知識(shí)與技術(shù);2009年19期

2 趙經(jīng)緯;周余;王自強(qiáng);都思丹;;基于Webkit的嵌入式瀏覽器的研究與實(shí)現(xiàn)[J];電子測(cè)量技術(shù);2009年03期

3 何長(zhǎng)林;周玉云;;網(wǎng)絡(luò)新模式Web2.0初探[J];河西學(xué)院學(xué)報(bào);2007年02期

4 王映,于滿泉,李盛韜,王斌,余智華;JavaScript引擎在動(dòng)態(tài)網(wǎng)頁(yè)采集技術(shù)中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2004年02期

5 錢程;陽(yáng)小蘭;;一種支持Ajax框架的網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)與數(shù)字工程;2012年04期

6 郭浩;陸余良;劉金紅;;一種基于狀態(tài)轉(zhuǎn)換圖的Ajax爬行算法[J];計(jì)算機(jī)應(yīng)用研究;2009年11期

7 范軒苗;鄭寧;范淵;;一種基于Ajax的爬蟲模型的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2010年01期

8 金曉鷗;鐘寶燕;李翔;;基于Rhino的JavaScript動(dòng)態(tài)頁(yè)面解析研究與實(shí)現(xiàn)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2008年02期

相關(guān)碩士學(xué)位論文前6條

1 楊帆;基于開源框架I-CMS系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2011年

2 秦金東;遠(yuǎn)程醫(yī)療保健終端網(wǎng)絡(luò)模塊的研究與實(shí)現(xiàn)[D];青島大學(xué);2011年

3 杜嬌;基于Webkit手機(jī)瀏覽器的研究與實(shí)現(xiàn)[D];南京郵電大學(xué);2012年

4 羅兵;支持AJAX的互聯(lián)網(wǎng)搜索引擎爬蟲設(shè)計(jì)與實(shí)現(xiàn)[D];浙江大學(xué);2007年

5 曾偉輝;支持AJAX的網(wǎng)絡(luò)爬蟲系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];中國(guó)科學(xué)技術(shù)大學(xué);2009年

6 胡石根;車載終端實(shí)時(shí)交通導(dǎo)航技術(shù)的研究及實(shí)現(xiàn)[D];華南理工大學(xué);2010年

，

本文編號(hào)：1468905

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1468905.html

上一篇：Internet中的國(guó)外藥學(xué)文獻(xiàn)檢索系統(tǒng)資源
下一篇：2007年國(guó)外信息組織方法與技術(shù)研究進(jìn)展述評(píng)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

支持AJAX的定址網(wǎng)絡(luò)爬蟲系統(tǒng)的研究與實(shí)現(xiàn)