支持AJAX的定址網(wǎng)絡(luò)爬蟲系統(tǒng)的研究與實(shí)現(xiàn)
本文關(guān)鍵詞: AJAX JaVaScript 網(wǎng)絡(luò)爬蟲 數(shù)據(jù)采集 定址 出處:《北京郵電大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:在Web2.0的概念出現(xiàn)后,一種被稱為RIA的具有高度互動性和豐富用戶體驗(yàn)的網(wǎng)絡(luò)應(yīng)用程序出現(xiàn)了,如博客、微博等,而AJAX技術(shù)由于符合Web2.0時代的需求,被越來越多的運(yùn)用在Web開發(fā)中。AJAX技術(shù)采用客戶端JavaScript動態(tài)修改DOM結(jié)構(gòu),實(shí)現(xiàn)了網(wǎng)頁的無縫化重構(gòu),提高了網(wǎng)頁的互動性、速度以及可用性。但與此同時,它改變了傳統(tǒng)的Web應(yīng)用模型,打破了傳統(tǒng)爬蟲依賴于分析頁面中超鏈接的爬行模式,使傳統(tǒng)爬蟲不能采集AJAX網(wǎng)頁中的動態(tài)內(nèi)容,這意味著大量有意義的數(shù)據(jù)無法通過搜索引擎檢索。 為了解決AJAX網(wǎng)站的動態(tài)數(shù)據(jù)采集問題,本文設(shè)計(jì)并實(shí)現(xiàn)了種支持AJAX的定址網(wǎng)絡(luò)爬蟲系統(tǒng)。首先,通過對傳統(tǒng)網(wǎng)絡(luò)爬蟲的研究,分析得出了AJAX爬蟲的技術(shù)難點(diǎn),并從一個實(shí)際的AJAX網(wǎng)站出發(fā),闡述了傳統(tǒng)爬蟲在爬行使用AJAX技術(shù)實(shí)現(xiàn)的網(wǎng)站時存在的關(guān)鍵問題以及研究應(yīng)用場景;其次,介紹了研究的相關(guān)概念和問題模型,并設(shè)計(jì)了系統(tǒng)運(yùn)行流程與系統(tǒng)架構(gòu);最后,通過對AJAX爬蟲中關(guān)鍵問題的分析與設(shè)計(jì),實(shí)現(xiàn)了一種支持AJAX的定址網(wǎng)絡(luò)爬蟲系統(tǒng)。 系統(tǒng)將傳統(tǒng)網(wǎng)絡(luò)爬蟲工作過程中的URL提取和下載網(wǎng)頁兩個功能分離開,使之成為兩個獨(dú)立的功能模塊。通過URL抽取模塊實(shí)現(xiàn)網(wǎng)站URL抽取,形成URL資源庫。采用Webkit渲染引擎實(shí)現(xiàn)的瀏覽器來加載HTML網(wǎng)頁并解析JavaScript代碼,并結(jié)合腳本生成器生成的JavaScript翻頁腳本,實(shí)現(xiàn)了從頁面DOM表示中識別用于頁面導(dǎo)航的頁面元素,自動觸發(fā)頁面元素上的事件,生成并提取分頁內(nèi)容。爬蟲系統(tǒng)只采集URL資源庫中鏈接地址導(dǎo)向的網(wǎng)頁信息,也就是說爬蟲的爬行范圍完全由URL資源庫限定,是受控的,即為“定址”的網(wǎng)絡(luò)爬蟲。 此外,利用三類(共六個)真實(shí)網(wǎng)站,對系統(tǒng)的查全率、準(zhǔn)確度及性能進(jìn)行了測試。實(shí)驗(yàn)結(jié)果表明,本系統(tǒng)的查全率達(dá)到了100%;在不翻頁采集的情況下,平均抓取速率達(dá)到52.03kb/s,系統(tǒng)展現(xiàn)出很好的效能。 研究表明,本系統(tǒng)能夠準(zhǔn)確抓取AJAX網(wǎng)站的動態(tài)內(nèi)容,并對相似網(wǎng)頁結(jié)構(gòu)的目標(biāo)網(wǎng)頁進(jìn)行分頁數(shù)據(jù)采集,系統(tǒng)具有較高的靈活性與適用性,可用于建設(shè)垂直搜索,以及開源情報(bào)采集等。
[Abstract]:After the concept of Web2.0 emerged, a highly interactive and user experience rich web application called RIA emerged, such as blog, Weibo and so on. However, AJAX technology is more and more used in Web development because it meets the needs of Web2.0 era. Ajax technology adopts client JavaScript to dynamically modify DOM structure. It realizes the seamless reconfiguration of web pages, improves the interaction, speed and usability of web pages, but at the same time, it changes the traditional Web application model. It breaks the traditional crawler's crawling mode which relies on the hyperlink in the analysis page, and makes the traditional crawler unable to collect the dynamic content in the AJAX page, which means that a lot of meaningful data can not be retrieved through the search engine. In order to solve the problem of dynamic data acquisition of AJAX website, this paper designs and implements a kind of addressable web crawler system supporting AJAX. Firstly, through the research of traditional web crawler. This paper analyzes the technical difficulties of AJAX crawler, and starts from a practical AJAX website. This paper expounds the key problems and application scenarios of the traditional crawlers when they use AJAX technology to realize the web sites. Secondly, the related concepts and problem models of the research are introduced, and the system running flow and system architecture are designed. Finally, through the analysis and design of the key problems in AJAX crawler, an addressable web crawler system supporting AJAX is implemented. The system separates the two functions of URL extraction and web page download from the traditional web crawler working process, making it two independent function modules. The URL extraction module is implemented through the URL extraction module. Form the URL repository. Use the Webkit rendering engine to implement the browser to load the HTML pages and parse the JavaScript code. Combined with the JavaScript page turning script generated by the script generator, the page elements used for page navigation are identified from the page DOM representation, and the events on the page elements are automatically triggered. The crawler system only collects the link address oriented web page information in the URL repository, that is to say, the crawler's crawling range is completely limited by the URL resource base and is controlled. That is, the "address" of the network crawler. In addition, the recall, accuracy and performance of the system are tested by using three kinds of (six) real websites. The experimental results show that the recall rate of the system has reached 100%. The average capture rate is 52.03 kb / s without page turning, and the system shows good performance. The research shows that the system can capture the dynamic content of AJAX website accurately and collect the paging data of the target pages with similar web page structure. The system has high flexibility and applicability. Can be used to build vertical search, as well as open source intelligence collection and so on.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 王晶;陳衛(wèi)衛(wèi);;AJAX搜索引擎研究[J];電腦知識與技術(shù);2009年19期
2 趙經(jīng)緯;周余;王自強(qiáng);都思丹;;基于Webkit的嵌入式瀏覽器的研究與實(shí)現(xiàn)[J];電子測量技術(shù);2009年03期
3 何長林;周玉云;;網(wǎng)絡(luò)新模式Web2.0初探[J];河西學(xué)院學(xué)報(bào);2007年02期
4 王映,于滿泉,李盛韜,王斌,余智華;JavaScript引擎在動態(tài)網(wǎng)頁采集技術(shù)中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2004年02期
5 錢程;陽小蘭;;一種支持Ajax框架的網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)與數(shù)字工程;2012年04期
6 郭浩;陸余良;劉金紅;;一種基于狀態(tài)轉(zhuǎn)換圖的Ajax爬行算法[J];計(jì)算機(jī)應(yīng)用研究;2009年11期
7 范軒苗;鄭寧;范淵;;一種基于Ajax的爬蟲模型的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2010年01期
8 金曉鷗;鐘寶燕;李翔;;基于Rhino的JavaScript動態(tài)頁面解析研究與實(shí)現(xiàn)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2008年02期
相關(guān)碩士學(xué)位論文 前6條
1 楊帆;基于開源框架I-CMS系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2011年
2 秦金東;遠(yuǎn)程醫(yī)療保健終端網(wǎng)絡(luò)模塊的研究與實(shí)現(xiàn)[D];青島大學(xué);2011年
3 杜嬌;基于Webkit手機(jī)瀏覽器的研究與實(shí)現(xiàn)[D];南京郵電大學(xué);2012年
4 羅兵;支持AJAX的互聯(lián)網(wǎng)搜索引擎爬蟲設(shè)計(jì)與實(shí)現(xiàn)[D];浙江大學(xué);2007年
5 曾偉輝;支持AJAX的網(wǎng)絡(luò)爬蟲系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];中國科學(xué)技術(shù)大學(xué);2009年
6 胡石根;車載終端實(shí)時交通導(dǎo)航技術(shù)的研究及實(shí)現(xiàn)[D];華南理工大學(xué);2010年
,本文編號:1468905
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1468905.html