基于Storm云平臺(tái)的分布式網(wǎng)絡(luò)爬蟲(chóng)技術(shù)研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-01-06 21:06

本文關(guān)鍵詞：基于Storm云平臺(tái)的分布式網(wǎng)絡(luò)爬蟲(chóng)技術(shù)研究與實(shí)現(xiàn)　出處：《電子科技大學(xué)》2015年碩士論文　論文類(lèi)型：學(xué)位論文

更多相關(guān)文章： 分布式 網(wǎng)絡(luò)爬蟲(chóng) storm 微博

【摘要】：隨著互聯(lián)網(wǎng)的高速發(fā)展,許多新型的商業(yè)模式,例如O2O等,被運(yùn)用到互聯(lián)網(wǎng)上,這導(dǎo)致越來(lái)越多的站點(diǎn)在互聯(lián)網(wǎng)上創(chuàng)建,因此互聯(lián)網(wǎng)上所包含的信息資源也就越來(lái)越多。在這浩瀚的互聯(lián)網(wǎng)大海中,人們想快速的找到自己想要的信息,搜索引擎的搜索技術(shù)就顯得愈發(fā)重要。而網(wǎng)絡(luò)爬蟲(chóng)是搜索引擎中很重要的組成部分,這也就對(duì)網(wǎng)絡(luò)爬蟲(chóng)提出了新的挑戰(zhàn)。傳統(tǒng)的單機(jī)網(wǎng)絡(luò)爬蟲(chóng)已經(jīng)不能滿(mǎn)足日益高速增長(zhǎng)的數(shù)據(jù)的抓取需求,這導(dǎo)致分布式網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的出現(xiàn)。分布式網(wǎng)絡(luò)爬蟲(chóng)利用多臺(tái)機(jī)器,有效的分工合作,提高了網(wǎng)絡(luò)爬蟲(chóng)的速度,從而從整體上提升了網(wǎng)絡(luò)爬蟲(chóng)的性能。本文設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)基于Storm的、可擴(kuò)展的分布式網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng),結(jié)合當(dāng)下流行的新浪微博平臺(tái),將網(wǎng)絡(luò)爬蟲(chóng)的數(shù)據(jù)源放在新浪微博上。具體來(lái)說(shuō),本文完成了以下的工作內(nèi)容:1、對(duì)本文中的分布式網(wǎng)絡(luò)爬蟲(chóng)的需求進(jìn)行了分析,包括對(duì)系統(tǒng)要實(shí)現(xiàn)的目標(biāo)、系統(tǒng)的可行性、功能需求和性能需求這四個(gè)模塊。其中功能需求分析中確定將本系統(tǒng)分成六大模塊,包括模擬登錄模塊、URL隊(duì)列庫(kù)模塊、URL鏈接優(yōu)化模塊、網(wǎng)頁(yè)下載模塊、網(wǎng)頁(yè)解析模塊和網(wǎng)頁(yè)存儲(chǔ)模塊,并對(duì)每個(gè)模塊的需求進(jìn)行了詳細(xì)的闡述。2、針對(duì)新浪微博,對(duì)本系統(tǒng)的網(wǎng)絡(luò)爬蟲(chóng)進(jìn)行了一個(gè)詳細(xì)設(shè)計(jì),包括數(shù)據(jù)庫(kù)的設(shè)計(jì)和系統(tǒng)架構(gòu)的設(shè)計(jì)。重點(diǎn)介紹了系統(tǒng)的整個(gè)架構(gòu)設(shè)計(jì),分別對(duì)六個(gè)模塊的設(shè)計(jì)進(jìn)行了詳細(xì)的說(shuō)明。3、針對(duì)本文實(shí)現(xiàn)的分布式網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng)進(jìn)行了一個(gè)測(cè)試,從系統(tǒng)的功能和性能兩個(gè)方面對(duì)其進(jìn)行了測(cè)試,并對(duì)測(cè)試的結(jié)果進(jìn)行分析。4、對(duì)本文的進(jìn)行了一個(gè)總結(jié),分析了本文存在的問(wèn)題和不足,并提出了今后繼續(xù)對(duì)本文的研究方向。
[Abstract]:With the rapid development of the Internet, many new business models, such as O2O, have been applied to the Internet, which has led to more and more sites being created on the Internet. Therefore, the Internet contains more and more information resources. In this vast sea of Internet, people want to quickly find the information they want. Search engine search technology is becoming more and more important, and web crawler is a very important part of search engine. This poses a new challenge to web crawlers. Traditional single-machine web crawlers can no longer meet the growing demand for data capture. This leads to the emergence of distributed network crawler technology. Distributed network crawler using multiple machines, effective division of work and cooperation, improve the speed of network crawler. This paper designs and implements an extensible distributed web crawler system based on Storm, combined with the current popular Sina Weibo platform. Put the data source of web crawler on Sina Weibo. Specifically, this paper completes the following work: 1, analyzes the requirements of distributed web crawler in this paper, including the goal of the system to be realized. The feasibility of the system, functional requirements and performance requirements of these four modules. In the analysis of functional requirements, it is determined that the system is divided into six modules, including the simulated login module and URL queue library module. URL link optimization module, page download module, web page analysis module and page storage module, and the requirements of each module are elaborated in detail, aiming at Sina Weibo. The network crawler of this system is designed in detail, including the design of database and the design of system architecture. The design of the six modules is described in detail. 3. The distributed web crawler system implemented in this paper is tested, and the function and performance of the system are tested. The results of the test. 4, a summary of this paper, analysis of the problems and shortcomings of this paper, and put forward the future research direction of this paper.
【學(xué)位授予單位】：電子科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2015
【分類(lèi)號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 袁威;薛安榮;周小梅;;基于Nutch的分布式爬蟲(chóng)的優(yōu)化研究[J];無(wú)線(xiàn)通信技術(shù);2014年03期

2 王金明;王遠(yuǎn)方;;基于Twitter Storm平臺(tái)并行挖掘最稠密子圖[J];計(jì)算機(jī)科學(xué);2014年01期

3 吳甘沙;連城;鐘翔;;低延遲流處理系統(tǒng)的逆襲[J];程序員;2013年10期

4 李戴維;李寧;;基于Solr的分布式全文檢索系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)與現(xiàn)代化;2012年11期

5 吳小惠;;分布式網(wǎng)絡(luò)爬蟲(chóng)URL去重策略的改進(jìn)[J];平頂山學(xué)院學(xué)報(bào);2009年05期

6 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲(chóng):研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期

7 鄭力明;易平;;基于HTMLParser信息提取的網(wǎng)絡(luò)爬蟲(chóng)設(shè)計(jì)[J];微計(jì)算機(jī)信息;2009年15期

8 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲(chóng)研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期

9 周立柱,林玲;聚焦爬蟲(chóng)技術(shù)研究綜述[J];計(jì)算機(jī)應(yīng)用;2005年09期

相關(guān)碩士學(xué)位論文前10條

1 萬(wàn)濤;基于hadoop的分布式網(wǎng)絡(luò)爬蟲(chóng)研究與實(shí)現(xiàn)[D];西安電子科技大學(xué);2014年

2 羅一紓;微博爬蟲(chóng)的相關(guān)技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2013年

3 呂陽(yáng);分布式網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年

4 李浩;基于Twitter Storm的云平臺(tái)監(jiān)控系統(tǒng)研究與實(shí)現(xiàn)[D];東北大學(xué);2013年

5 魏麗娟;基于hadoop的網(wǎng)絡(luò)爬蟲(chóng)技術(shù)研究[D];武漢理工大學(xué);2013年

6 蔡學(xué)鋒;基于Solr的搜索引擎核心技術(shù)研究與應(yīng)用[D];武漢理工大學(xué);2013年

7 陳光景;Hadoop小文件處理技術(shù)的研究和實(shí)現(xiàn)[D];南京郵電大學(xué);2013年

8 單月光;基于微博的網(wǎng)絡(luò)輿情關(guān)鍵技術(shù)的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2013年

9 張祥;一個(gè)網(wǎng)頁(yè)分類(lèi)系統(tǒng)的研究與實(shí)現(xiàn)[D];北京郵電大學(xué);2013年

10 周亞平;面向定題領(lǐng)域的事件驅(qū)動(dòng)和協(xié)議驅(qū)動(dòng)的主題爬蟲(chóng)應(yīng)用研究[D];湖南科技大學(xué);2012年

，

本文編號(hào)：1389504

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1389504.html

上一篇：基于改進(jìn)用戶(hù)瀏覽行為個(gè)性化搜索引擎系統(tǒng)研究
下一篇：谷歌的王道

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Storm云平臺(tái)的分布式網(wǎng)絡(luò)爬蟲(chóng)技術(shù)研究與實(shí)現(xiàn)