基于Docker集群的分布式爬蟲(chóng)研究與設(shè)計(jì)

發(fā)布時(shí)間：2018-06-18 10:21

本文選題：Docker + 分布式爬蟲(chóng)��；參考：《浙江理工大學(xué)》2017年碩士論文

【摘要】：自從政府提出實(shí)施國(guó)家大數(shù)據(jù)戰(zhàn)略以來(lái),互聯(lián)網(wǎng)大數(shù)據(jù)成為重要的戰(zhàn)略資源的地位越來(lái)越明顯。而開(kāi)采互聯(lián)網(wǎng)大數(shù)據(jù)的有效工具網(wǎng)絡(luò)爬蟲(chóng)也顯得更加重要,但傳統(tǒng)的爬蟲(chóng)均建立在VM集群之上,存在著宿主機(jī)資源利用不充分且爬蟲(chóng)系統(tǒng)難以擴(kuò)展等問(wèn)題。隨著新興虛擬化技術(shù)Docker的發(fā)展,為解決原有運(yùn)行在VM環(huán)境上的網(wǎng)絡(luò)爬蟲(chóng)存在的問(wèn)題提供了契機(jī)�；贒ocker集群分布式爬蟲(chóng)主要從分布式爬蟲(chóng)技術(shù)和Docker集群技術(shù)兩個(gè)方面進(jìn)行研究。目前開(kāi)源的爬蟲(chóng)框架對(duì)分布式的支持程度不同,例如Scrapy爬蟲(chóng)框架不支持分布式,并且現(xiàn)有框架比較適合運(yùn)行在VM集群環(huán)境之上,存在著VM集群帶來(lái)的系統(tǒng)資源利用不充分的缺點(diǎn)。Docker集群是一種全新的虛擬化集群技術(shù),比VM集群更加合理高效的利用宿主機(jī)的各種資源。通過(guò)研究開(kāi)源網(wǎng)絡(luò)爬蟲(chóng)架構(gòu),本文設(shè)計(jì)并實(shí)現(xiàn)完全支持分布式的網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng),并使之運(yùn)行在Docker集群之上。本文還進(jìn)一步改進(jìn)爬蟲(chóng)的URL去重算法,采用具有更好去重效果的K分型Bloom filter算法,并使其滿足分布式情況下的應(yīng)用需求。本文的主要工作有以下幾個(gè)方面:(1)深入研究網(wǎng)絡(luò)爬蟲(chóng)的工作原理,掌握其整體架構(gòu)的設(shè)計(jì)模式。詳細(xì)研究Docker集群的編排管理工具,掌握其工作原理以及管理和調(diào)度機(jī)制。研究?jī)?nèi)容去重算法,并應(yīng)用于分布式爬蟲(chóng)系統(tǒng)。(2)通過(guò)研究開(kāi)源的網(wǎng)絡(luò)爬蟲(chóng)框架,理解其不支持分布式的原因,設(shè)計(jì)并實(shí)現(xiàn)出適合Docker集群的分布式爬蟲(chóng)系統(tǒng)模塊。并將系統(tǒng)模塊有效的組合起來(lái),形成完整高效的分布式爬蟲(chóng)系統(tǒng)。采用Docker集群編排管理工具Kubernetes來(lái)對(duì)分布式爬蟲(chóng)系統(tǒng)的各個(gè)功能模塊進(jìn)行部署和管理,使之成功運(yùn)行在Docker集群之上。(3)將實(shí)現(xiàn)的分布式爬蟲(chóng)分別搭建在VM集群和Docker集群之上進(jìn)行不同層次的實(shí)驗(yàn)對(duì)比,來(lái)證明分布式爬蟲(chóng)系統(tǒng)運(yùn)行在Docker集群之上有更好的抓取效率,更加充分的利用宿主機(jī)資源,并且容易實(shí)現(xiàn)系統(tǒng)水平擴(kuò)展。(4)理解經(jīng)典的Bloom filter算法的原理,并對(duì)其誤差概率進(jìn)行研究。通過(guò)改進(jìn)K分型Bloom filter算法使其滿足分布式情況下的應(yīng)用需求,并進(jìn)一步提高去重效果,降低誤差概率。最后通過(guò)實(shí)驗(yàn)證明改進(jìn)后的K分型Bloom filter有更好的去重效果。
[Abstract]:Since the government put forward the national big data strategy, the status of Internet big data as an important strategic resource has become more and more obvious. However, the traditional crawlers are based on VM clusters, and there are some problems such as insufficient utilization of host resources and difficulty in extending crawler systems. With the development of new virtualization technology Docker provides an opportunity to solve the problems of web crawlers running in VM environment. Distributed crawler based on Docker cluster is mainly studied from two aspects: distributed crawler technology and Docker cluster technology. The current open source crawler framework has different degrees of support for distribution, for example, Scrapy crawler framework does not support distributed, and the existing framework is more suitable for running on VM cluster environment. Docker cluster is a new virtualization cluster technology, which is more reasonable and efficient than VM cluster to utilize all kinds of resources of host. Through the research of open source web crawler architecture, this paper designs and implements a distributed web crawler system and makes it run on Docker cluster. This paper also further improves the crawler's URL removal algorithm, adopts K-typed Bloom filter algorithm with better removal effect, and makes it meet the requirements of distributed applications. The main work of this paper is as follows: 1) deeply studying the working principle of web crawler and mastering the design pattern of its whole architecture. This paper studies the orchestration management tool of Docker cluster in detail, and grasps its working principle and management and scheduling mechanism. By studying the open source web crawler framework and understanding the reason why it does not support distributed, the distributed crawler system module suitable for Docker cluster is designed and implemented. And the system modules are effectively combined to form a complete and efficient distributed crawler system. Kubernetes, a Docker cluster orchestration management tool, is used to deploy and manage the functional modules of distributed crawler systems. Make it run on Docker cluster. 3) build distributed crawler on VM cluster and Docker cluster for different levels of experiments, to prove that distributed crawler system running on Docker cluster has better crawling efficiency. It is easier to realize the horizontal expansion of the system by making full use of the host resource and to understand the principle of the classical Bloom filter algorithm, and to study its error probability. The K-typing Bloom filter algorithm is improved to meet the requirements of distributed applications, and further improve the removal effect and reduce the error probability. Finally, the improved K-typing Bloom filter has been proved to be more effective.
【學(xué)位授予單位】：浙江理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前1條

1 嚴(yán)華云;關(guān)佶紅;;Bloom Filter研究進(jìn)展[J];電信科學(xué);2010年02期

相關(guān)碩士學(xué)位論文前7條

1 杜軍;基于Kubernetes的云端資源調(diào)度器改進(jìn)[D];浙江大學(xué);2016年

2 陳星宇;基于容器云平臺(tái)的網(wǎng)絡(luò)資源管理與配置系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];浙江大學(xué);2016年

3 閆明;高可用可擴(kuò)展集群化Redis設(shè)計(jì)與實(shí)現(xiàn)[D];西安電子科技大學(xué);2014年

4 魏會(huì)建;基于屬性約簡(jiǎn)和屬性加權(quán)的樸素貝葉斯分類算法的研究[D];吉林大學(xué);2014年

5 趙鵬程;分布式書(shū)籍網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];西南交通大學(xué);2014年

6 朱彥杰;基于搜索引擎的輿情分析系統(tǒng)研究與實(shí)現(xiàn)[D];電子科技大學(xué);2012年

7 程錦佳;基于Hadoop的分布式爬蟲(chóng)及其實(shí)現(xiàn)[D];北京郵電大學(xué);2010年

，

本文編號(hào)：2035147

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2035147.html

上一篇：基于雙目視覺(jué)算法的圖像清晰化算法研究
下一篇：虛擬環(huán)境“數(shù)字腳

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Docker集群的分布式爬蟲(chóng)研究與設(shè)計(jì)