分布式網(wǎng)絡(luò)爬蟲的研究與實現(xiàn)

發(fā)布時間：2018-06-15 04:46

本文選題：搜索引擎 + 分布式��；參考：《東南大學(xué)》2017年碩士論文

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,人們在工作和生活中對互聯(lián)網(wǎng)信息的需求也越來越多,搜索引擎技術(shù)的重要性越加明顯�；ヂ�(lián)網(wǎng)信息在很多方面都有非常廣泛的應(yīng)用,搜索引擎技術(shù)已經(jīng)深入人心,融入人們的生活,對人們的生活影響越來越大,而網(wǎng)絡(luò)爬蟲是搜索引擎中非常重要的一個部分。目前,基于單機的網(wǎng)絡(luò)爬蟲抓取能力已經(jīng)不能滿足當(dāng)前互聯(lián)網(wǎng)的需求,這樣就促使了基于分布式網(wǎng)絡(luò)爬蟲技術(shù)的出現(xiàn)。構(gòu)建分布式系統(tǒng),多臺機器有效的合作分工,提高了超大數(shù)據(jù)量的計算速度,提高了網(wǎng)絡(luò)爬蟲的抓取性能。運用分布式存儲,對整個系統(tǒng)數(shù)據(jù)的存儲的性能也能大大提高。本文詳細介紹了分布式網(wǎng)絡(luò)爬蟲,設(shè)計并實現(xiàn)了基于Hadoop平臺的分布式網(wǎng)絡(luò)爬蟲,以解決單機網(wǎng)絡(luò)爬蟲的速度慢、效率低下等問題,本文的主要研究工作如下:(1)介紹了搜索引擎技術(shù),分布式網(wǎng)絡(luò)爬蟲的工作原理和關(guān)鍵技術(shù),分布式網(wǎng)絡(luò)爬蟲整體系統(tǒng)的架構(gòu)設(shè)計,分析了關(guān)鍵組成模塊的具體實現(xiàn)流程和實現(xiàn)原理,各模塊的MapReduce的實現(xiàn)方式。(2)針對網(wǎng)頁抓取模塊的已有算法影響抓取內(nèi)容和抓取速度的問題,提出了 URL權(quán)重算法的優(yōu)化方法,在抓取過后,對URL的過濾和去重也是極其重要的環(huán)節(jié),對URL去重策略也進行了優(yōu)化,解決了網(wǎng)絡(luò)爬蟲抓取方面速度慢、抓取內(nèi)容冗余的問題,大大提高了網(wǎng)絡(luò)爬蟲抓取速度和準(zhǔn)確度。(3)搭建分布式系統(tǒng)的測試環(huán)境,從功能性測試、性能測試、可擴展性測試三個方面設(shè)計了測試方案,并對URL權(quán)重算法和URL去重策略優(yōu)化前后進行了對比測試�？傊�,本文的意義在于設(shè)計實現(xiàn)了分布式網(wǎng)絡(luò)爬蟲系統(tǒng),在一定程度上解決了單機爬蟲效率低、可擴展性差的弊端,提高了網(wǎng)絡(luò)爬蟲采集信息、網(wǎng)頁抓取數(shù)據(jù)的速度和質(zhì)量。
[Abstract]:With the rapid development of Internet technology, more and more people need Internet information in their work and life, and the importance of search engine technology becomes more and more obvious. Internet information has been widely used in many aspects, search engine technology has been deeply rooted in the people, into people's lives, more and more impact on people's lives, and the web crawler is a very important part of the search engine. At present, the ability of crawler crawling based on single machine can not meet the current demand of the Internet, which promotes the emergence of distributed crawler technology. In order to construct a distributed system, many machines can work together effectively, which can improve the speed of computing large amount of data and improve the crawler's capture performance. With distributed storage, the performance of data storage in the whole system can be greatly improved. This paper introduces the distributed network crawler in detail, and designs and implements the distributed network crawler based on Hadoop platform to solve the problems of slow speed and low efficiency of single machine network crawler. The main research work of this paper is as follows: 1) introduce the technology of search engine, the working principle and key technology of distributed web crawler, the architecture design of the whole system of distributed web crawler, This paper analyzes the concrete realization flow and realization principle of the key component module, and the realization mode of MapReduce of each module. Aiming at the problem that the existing algorithms of the web crawling module affect the grab content and speed, the optimization method of URL weight algorithm is put forward. After crawling, filtering and removing URL is also very important. The strategy of URL removal is optimized to solve the problem of slow speed and redundant content of crawler. It greatly improves the speed and accuracy of crawler to build the testing environment of distributed system. The test scheme is designed from three aspects: functional test, performance test and extensibility test. And the URL weight algorithm and URL removal strategy before and after optimization were compared and tested. In a word, the significance of this paper lies in the design and implementation of distributed web crawler system, which to some extent solves the disadvantages of low efficiency and poor expansibility of single crawler, and improves the speed and quality of web crawler information collection and web page data capture.
【學(xué)位授予單位】：東南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.3

【參考文獻】

相關(guān)期刊論文前10條

1 李戴維;李寧;;基于Solr的分布式全文檢索系統(tǒng)的研究與實現(xiàn)[J];計算機與現(xiàn)代化;2012年11期

2 吳黎兵;柯亞林;何炎祥;劉楠;;分布式網(wǎng)絡(luò)爬蟲的設(shè)計與實現(xiàn)[J];計算機應(yīng)用與軟件;2011年11期

3 詹恒飛;楊岳湘;方宏;;Nutch分布式網(wǎng)絡(luò)爬蟲研究與優(yōu)化[J];計算機科學(xué)與探索;2011年01期

4 彭賡;范明鈺;;基于改進網(wǎng)絡(luò)爬蟲技術(shù)的SQL注入漏洞檢測[J];計算機應(yīng)用研究;2010年07期

5 許笑;張偉哲;張宏莉;方濱興;;廣域網(wǎng)分布式Web爬蟲[J];軟件學(xué)報;2010年05期

6 王鋒;王偉;張t，

本文編號：2020734

資料下載

論文發(fā)表

本文鏈接：http://sikaile.net/shoufeilunwen/xixikjs/2020734.html

上一篇：改進的離散粒子群算法在TSP中的應(yīng)用研究
下一篇：九江職業(yè)技術(shù)學(xué)院教科研管理系統(tǒng)設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分布式網(wǎng)絡(luò)爬蟲的研究與實現(xiàn)