天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分布式網(wǎng)絡(luò)爬蟲(chóng)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-06-15 04:46

  本文選題:搜索引擎 + 分布式。 參考:《東南大學(xué)》2017年碩士論文


【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,人們?cè)诠ぷ骱蜕钪袑?duì)互聯(lián)網(wǎng)信息的需求也越來(lái)越多,搜索引擎技術(shù)的重要性越加明顯;ヂ(lián)網(wǎng)信息在很多方面都有非常廣泛的應(yīng)用,搜索引擎技術(shù)已經(jīng)深入人心,融入人們的生活,對(duì)人們的生活影響越來(lái)越大,而網(wǎng)絡(luò)爬蟲(chóng)是搜索引擎中非常重要的一個(gè)部分。目前,基于單機(jī)的網(wǎng)絡(luò)爬蟲(chóng)抓取能力已經(jīng)不能滿足當(dāng)前互聯(lián)網(wǎng)的需求,這樣就促使了基于分布式網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的出現(xiàn)。構(gòu)建分布式系統(tǒng),多臺(tái)機(jī)器有效的合作分工,提高了超大數(shù)據(jù)量的計(jì)算速度,提高了網(wǎng)絡(luò)爬蟲(chóng)的抓取性能。運(yùn)用分布式存儲(chǔ),對(duì)整個(gè)系統(tǒng)數(shù)據(jù)的存儲(chǔ)的性能也能大大提高。本文詳細(xì)介紹了分布式網(wǎng)絡(luò)爬蟲(chóng),設(shè)計(jì)并實(shí)現(xiàn)了基于Hadoop平臺(tái)的分布式網(wǎng)絡(luò)爬蟲(chóng),以解決單機(jī)網(wǎng)絡(luò)爬蟲(chóng)的速度慢、效率低下等問(wèn)題,本文的主要研究工作如下:(1)介紹了搜索引擎技術(shù),分布式網(wǎng)絡(luò)爬蟲(chóng)的工作原理和關(guān)鍵技術(shù),分布式網(wǎng)絡(luò)爬蟲(chóng)整體系統(tǒng)的架構(gòu)設(shè)計(jì),分析了關(guān)鍵組成模塊的具體實(shí)現(xiàn)流程和實(shí)現(xiàn)原理,各模塊的MapReduce的實(shí)現(xiàn)方式。(2)針對(duì)網(wǎng)頁(yè)抓取模塊的已有算法影響抓取內(nèi)容和抓取速度的問(wèn)題,提出了 URL權(quán)重算法的優(yōu)化方法,在抓取過(guò)后,對(duì)URL的過(guò)濾和去重也是極其重要的環(huán)節(jié),對(duì)URL去重策略也進(jìn)行了優(yōu)化,解決了網(wǎng)絡(luò)爬蟲(chóng)抓取方面速度慢、抓取內(nèi)容冗余的問(wèn)題,大大提高了網(wǎng)絡(luò)爬蟲(chóng)抓取速度和準(zhǔn)確度。(3)搭建分布式系統(tǒng)的測(cè)試環(huán)境,從功能性測(cè)試、性能測(cè)試、可擴(kuò)展性測(cè)試三個(gè)方面設(shè)計(jì)了測(cè)試方案,并對(duì)URL權(quán)重算法和URL去重策略優(yōu)化前后進(jìn)行了對(duì)比測(cè)試?傊,本文的意義在于設(shè)計(jì)實(shí)現(xiàn)了分布式網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng),在一定程度上解決了單機(jī)爬蟲(chóng)效率低、可擴(kuò)展性差的弊端,提高了網(wǎng)絡(luò)爬蟲(chóng)采集信息、網(wǎng)頁(yè)抓取數(shù)據(jù)的速度和質(zhì)量。
[Abstract]:With the rapid development of Internet technology, more and more people need Internet information in their work and life, and the importance of search engine technology becomes more and more obvious. Internet information has been widely used in many aspects, search engine technology has been deeply rooted in the people, into people's lives, more and more impact on people's lives, and the web crawler is a very important part of the search engine. At present, the ability of crawler crawling based on single machine can not meet the current demand of the Internet, which promotes the emergence of distributed crawler technology. In order to construct a distributed system, many machines can work together effectively, which can improve the speed of computing large amount of data and improve the crawler's capture performance. With distributed storage, the performance of data storage in the whole system can be greatly improved. This paper introduces the distributed network crawler in detail, and designs and implements the distributed network crawler based on Hadoop platform to solve the problems of slow speed and low efficiency of single machine network crawler. The main research work of this paper is as follows: 1) introduce the technology of search engine, the working principle and key technology of distributed web crawler, the architecture design of the whole system of distributed web crawler, This paper analyzes the concrete realization flow and realization principle of the key component module, and the realization mode of MapReduce of each module. Aiming at the problem that the existing algorithms of the web crawling module affect the grab content and speed, the optimization method of URL weight algorithm is put forward. After crawling, filtering and removing URL is also very important. The strategy of URL removal is optimized to solve the problem of slow speed and redundant content of crawler. It greatly improves the speed and accuracy of crawler to build the testing environment of distributed system. The test scheme is designed from three aspects: functional test, performance test and extensibility test. And the URL weight algorithm and URL removal strategy before and after optimization were compared and tested. In a word, the significance of this paper lies in the design and implementation of distributed web crawler system, which to some extent solves the disadvantages of low efficiency and poor expansibility of single crawler, and improves the speed and quality of web crawler information collection and web page data capture.
【學(xué)位授予單位】:東南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 李戴維;李寧;;基于Solr的分布式全文檢索系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)與現(xiàn)代化;2012年11期

2 吳黎兵;柯亞林;何炎祥;劉楠;;分布式網(wǎng)絡(luò)爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2011年11期

3 詹恒飛;楊岳湘;方宏;;Nutch分布式網(wǎng)絡(luò)爬蟲(chóng)研究與優(yōu)化[J];計(jì)算機(jī)科學(xué)與探索;2011年01期

4 彭賡;范明鈺;;基于改進(jìn)網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的SQL注入漏洞檢測(cè)[J];計(jì)算機(jī)應(yīng)用研究;2010年07期

5 許笑;張偉哲;張宏莉;方濱興;;廣域網(wǎng)分布式Web爬蟲(chóng)[J];軟件學(xué)報(bào);2010年05期

6 王鋒;王偉;張t,

本文編號(hào):2020734


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/2020734.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶e6aae***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
日本 一区二区 在线| 中文字幕不卡欧美在线| 国产午夜精品久久福利| 久久精品国产亚洲熟女| 少妇在线一区二区三区| 人妻中文一区二区三区| 黄色片一区二区在线观看| 国产精品丝袜一二三区| 亚洲欧美黑人一区二区| 亚洲欧美日本国产不卡| 不卡一区二区在线视频| 夫妻性生活黄色录像视频| 国产在线日韩精品欧美| 国产乱淫av一区二区三区| 五月激情婷婷丁香六月网| 麻豆剧果冻传媒一二三区| 中字幕一区二区三区久久蜜桃| 欧美日韩国产黑人一区| 国产传媒中文字幕东京热| 欧美韩国日本精品在线| 国产黄色高清内射熟女视频| 91超频在线视频中文字幕| 亚洲一区二区亚洲日本| 日本一二三区不卡免费| 在线观看那种视频你懂的| 女生更色还是男生更色| 好吊视频有精品永久免费| 国产又色又爽又黄又免费| 欧美日韩人妻中文一区二区| 中国一区二区三区不卡| 国产精品一区二区视频大全| 91精品欧美综合在ⅹ| 国产又色又爽又黄又免费| 国产麻豆视频一二三区| 国产超薄黑色肉色丝袜| 欧美激情床戏一区二区三| 中文字幕精品人妻一区| 精品人妻一区二区三区四在线| 有坂深雪中文字幕亚洲中文| 精品日韩欧美一区久久| 亚洲日本中文字幕视频在线观看 |