分布式網(wǎng)絡(luò)爬蟲(chóng)的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-06-15 04:46
本文選題:搜索引擎 + 分布式。 參考:《東南大學(xué)》2017年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,人們?cè)诠ぷ骱蜕钪袑?duì)互聯(lián)網(wǎng)信息的需求也越來(lái)越多,搜索引擎技術(shù)的重要性越加明顯;ヂ(lián)網(wǎng)信息在很多方面都有非常廣泛的應(yīng)用,搜索引擎技術(shù)已經(jīng)深入人心,融入人們的生活,對(duì)人們的生活影響越來(lái)越大,而網(wǎng)絡(luò)爬蟲(chóng)是搜索引擎中非常重要的一個(gè)部分。目前,基于單機(jī)的網(wǎng)絡(luò)爬蟲(chóng)抓取能力已經(jīng)不能滿足當(dāng)前互聯(lián)網(wǎng)的需求,這樣就促使了基于分布式網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的出現(xiàn)。構(gòu)建分布式系統(tǒng),多臺(tái)機(jī)器有效的合作分工,提高了超大數(shù)據(jù)量的計(jì)算速度,提高了網(wǎng)絡(luò)爬蟲(chóng)的抓取性能。運(yùn)用分布式存儲(chǔ),對(duì)整個(gè)系統(tǒng)數(shù)據(jù)的存儲(chǔ)的性能也能大大提高。本文詳細(xì)介紹了分布式網(wǎng)絡(luò)爬蟲(chóng),設(shè)計(jì)并實(shí)現(xiàn)了基于Hadoop平臺(tái)的分布式網(wǎng)絡(luò)爬蟲(chóng),以解決單機(jī)網(wǎng)絡(luò)爬蟲(chóng)的速度慢、效率低下等問(wèn)題,本文的主要研究工作如下:(1)介紹了搜索引擎技術(shù),分布式網(wǎng)絡(luò)爬蟲(chóng)的工作原理和關(guān)鍵技術(shù),分布式網(wǎng)絡(luò)爬蟲(chóng)整體系統(tǒng)的架構(gòu)設(shè)計(jì),分析了關(guān)鍵組成模塊的具體實(shí)現(xiàn)流程和實(shí)現(xiàn)原理,各模塊的MapReduce的實(shí)現(xiàn)方式。(2)針對(duì)網(wǎng)頁(yè)抓取模塊的已有算法影響抓取內(nèi)容和抓取速度的問(wèn)題,提出了 URL權(quán)重算法的優(yōu)化方法,在抓取過(guò)后,對(duì)URL的過(guò)濾和去重也是極其重要的環(huán)節(jié),對(duì)URL去重策略也進(jìn)行了優(yōu)化,解決了網(wǎng)絡(luò)爬蟲(chóng)抓取方面速度慢、抓取內(nèi)容冗余的問(wèn)題,大大提高了網(wǎng)絡(luò)爬蟲(chóng)抓取速度和準(zhǔn)確度。(3)搭建分布式系統(tǒng)的測(cè)試環(huán)境,從功能性測(cè)試、性能測(cè)試、可擴(kuò)展性測(cè)試三個(gè)方面設(shè)計(jì)了測(cè)試方案,并對(duì)URL權(quán)重算法和URL去重策略優(yōu)化前后進(jìn)行了對(duì)比測(cè)試?傊,本文的意義在于設(shè)計(jì)實(shí)現(xiàn)了分布式網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng),在一定程度上解決了單機(jī)爬蟲(chóng)效率低、可擴(kuò)展性差的弊端,提高了網(wǎng)絡(luò)爬蟲(chóng)采集信息、網(wǎng)頁(yè)抓取數(shù)據(jù)的速度和質(zhì)量。
[Abstract]:With the rapid development of Internet technology, more and more people need Internet information in their work and life, and the importance of search engine technology becomes more and more obvious. Internet information has been widely used in many aspects, search engine technology has been deeply rooted in the people, into people's lives, more and more impact on people's lives, and the web crawler is a very important part of the search engine. At present, the ability of crawler crawling based on single machine can not meet the current demand of the Internet, which promotes the emergence of distributed crawler technology. In order to construct a distributed system, many machines can work together effectively, which can improve the speed of computing large amount of data and improve the crawler's capture performance. With distributed storage, the performance of data storage in the whole system can be greatly improved. This paper introduces the distributed network crawler in detail, and designs and implements the distributed network crawler based on Hadoop platform to solve the problems of slow speed and low efficiency of single machine network crawler. The main research work of this paper is as follows: 1) introduce the technology of search engine, the working principle and key technology of distributed web crawler, the architecture design of the whole system of distributed web crawler, This paper analyzes the concrete realization flow and realization principle of the key component module, and the realization mode of MapReduce of each module. Aiming at the problem that the existing algorithms of the web crawling module affect the grab content and speed, the optimization method of URL weight algorithm is put forward. After crawling, filtering and removing URL is also very important. The strategy of URL removal is optimized to solve the problem of slow speed and redundant content of crawler. It greatly improves the speed and accuracy of crawler to build the testing environment of distributed system. The test scheme is designed from three aspects: functional test, performance test and extensibility test. And the URL weight algorithm and URL removal strategy before and after optimization were compared and tested. In a word, the significance of this paper lies in the design and implementation of distributed web crawler system, which to some extent solves the disadvantages of low efficiency and poor expansibility of single crawler, and improves the speed and quality of web crawler information collection and web page data capture.
【學(xué)位授予單位】:東南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李戴維;李寧;;基于Solr的分布式全文檢索系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)與現(xiàn)代化;2012年11期
2 吳黎兵;柯亞林;何炎祥;劉楠;;分布式網(wǎng)絡(luò)爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2011年11期
3 詹恒飛;楊岳湘;方宏;;Nutch分布式網(wǎng)絡(luò)爬蟲(chóng)研究與優(yōu)化[J];計(jì)算機(jī)科學(xué)與探索;2011年01期
4 彭賡;范明鈺;;基于改進(jìn)網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的SQL注入漏洞檢測(cè)[J];計(jì)算機(jī)應(yīng)用研究;2010年07期
5 許笑;張偉哲;張宏莉;方濱興;;廣域網(wǎng)分布式Web爬蟲(chóng)[J];軟件學(xué)報(bào);2010年05期
6 王鋒;王偉;張t,
本文編號(hào):2020734
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/2020734.html
最近更新
教材專(zhuān)著