利用Nutch研究與實現(xiàn)支持Ajax動態(tài)網(wǎng)頁的網(wǎng)絡(luò)爬蟲系統(tǒng)

發(fā)布時間：2018-10-22 12:28

【摘要】：隨著Web2.0的快速發(fā)展，網(wǎng)站對于Ajax技術(shù)的應(yīng)用越來越多。Ajax技術(shù)通過異步調(diào)用，進行頁面局部刷新，在很大程度上提高了用戶的體驗度、減少了網(wǎng)絡(luò)傳輸流量以及提高了網(wǎng)站的訪問速度等。在Ajax技術(shù)使得互聯(lián)網(wǎng)的交互模式發(fā)生變革的同時，也給用戶和開發(fā)人員帶來了一系列的問題。例如JavaScript代碼的使用和編寫不規(guī)范、瀏覽器的不兼容性、頁面請求次數(shù)過多、Ajax技術(shù)的濫用造成的服務(wù)器負擔過重等許多問題。爬蟲系統(tǒng)屬于搜索引擎中的一個必備的數(shù)據(jù)采集子系統(tǒng)，搜索引擎根據(jù)爬蟲系統(tǒng)采集的數(shù)據(jù)建立索引后，對用戶提供搜索服務(wù)。Ajax技術(shù)的大量使用對于搜索引擎也有著重要的影響。傳統(tǒng)的搜索引擎只提供了對靜態(tài)頁面的數(shù)據(jù)的搜索服務(wù)，對由Ajax技術(shù)產(chǎn)生的動態(tài)數(shù)據(jù)卻不能提供搜索服務(wù)。Ajax技術(shù)的大量使用造成了由Ajax技術(shù)生成的頁面動態(tài)數(shù)據(jù)的日益龐大。這部分動態(tài)數(shù)據(jù)在數(shù)據(jù)分析、數(shù)據(jù)挖掘等方面都具有重要的意義。例如新浪新聞上面的部分評論是通過Ajax技術(shù)動態(tài)生成的，這部分數(shù)據(jù)的采集對于國家安全方面是有著重要意義的。本論文通過對Nutch進行改進，增加部分模塊，建立了一個能夠爬取Ajax動態(tài)數(shù)據(jù)的網(wǎng)絡(luò)爬蟲系統(tǒng)，，并且根據(jù)數(shù)據(jù)建立了索引，對用戶提供了搜索服務(wù)。
[Abstract]:With the rapid development of Web2.0, the application of Ajax technology is more and more. Ajax technology through asynchronous calls to carry out local page refresh, to a large extent, improve the user's experience, It reduces the network traffic and improves the visiting speed of the website. While Ajax technology changes the interaction mode of the Internet, it also brings a series of problems to users and developers. For example, the use and writing of JavaScript code is not standardized, the browser is not compatible, the number of page requests is too many, the abuse of Ajax technology caused by the excessive burden of servers and many other problems. The crawler system belongs to a necessary data collection subsystem in the search engine. After the search engine establishes the index according to the data collected by the crawler system, Providing search services to users. The extensive use of Ajax technology also has an important impact on search engines. The traditional search engine only provides the search service for the static page data, but not the search service for the dynamic data generated by the Ajax technology. The extensive use of the Ajax technology has resulted in the increasing volume of the page dynamic data generated by the Ajax technology. This part of dynamic data is of great significance in data analysis and data mining. For example, some of the comments above Sina News are generated dynamically through Ajax technology, and the collection of data is of great significance to national security. In this paper, we improve Nutch, add some modules, build a web crawler system which can crawl Ajax dynamic data, build index according to the data, and provide search service to users.
【學位授予單位】：內(nèi)蒙古師范大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP391.3

【參考文獻】

相關(guān)期刊論文前5條

1 查志華;李偉;;搜索引擎的技術(shù)現(xiàn)狀及發(fā)展趨勢[J];兵團教育學院學報;2006年03期

2 趙志宏;黃蕾;劉峰;陳振宇;;Deep Web搜索技術(shù)進展綜述[J];山東大學學報(工學版);2009年02期

3 鄭冬冬;崔志明;;Deep Web爬蟲爬行策略研究[J];計算機工程與設(shè)計;2006年17期

4 胡少榮;孟嗣儀;劉云;張彥超;丁飛;;網(wǎng)頁信息自動抽取技術(shù)的研究[J];鐵路計算機應(yīng)用;2010年09期

5 嚴亞蘭;面向動態(tài)網(wǎng)頁爬行的Crawler架構(gòu)[J];圖書情報知識;2003年04期

相關(guān)碩士學位論文前6條

1 王佳;支持Ajax技術(shù)的主題網(wǎng)絡(luò)爬蟲系統(tǒng)研究與實現(xiàn)[D];北京交通大學;2011年

2 羅兵;支持AJAX的互聯(lián)網(wǎng)搜索引擎爬蟲設(shè)計與實現(xiàn)[D];浙江大學;2007年

3 肖卓磊;基于Ajax技術(shù)的搜索引擎研究[D];武漢理工大學;2009年

4 袁小節(jié);基于協(xié)議驅(qū)動與事件驅(qū)動的綜合聚焦爬蟲研究與實現(xiàn)[D];國防科學技術(shù)大學;2009年

5 曾偉輝;支持AJAX的網(wǎng)絡(luò)爬蟲系統(tǒng)設(shè)計與實現(xiàn)[D];中國科學技術(shù)大學;2009年

6 莊重;WEB信息抽取的研究[D];湖北工業(yè)大學;2009年

本文編號：2287161

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2287161.html

上一篇：基于Internet的個性化信息檢索技術(shù)的研究
下一篇：搜索引擎下網(wǎng)頁預(yù)覽的著作權(quán)問題研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

利用Nutch研究與實現(xiàn)支持Ajax動態(tài)網(wǎng)頁的網(wǎng)絡(luò)爬蟲系統(tǒng)