當(dāng)前位置：主頁 > 管理論文 > 移動網(wǎng)絡(luò)論文 >

基于Hadoop平臺的網(wǎng)絡(luò)爬蟲技術(shù)研究

發(fā)布時(shí)間：2018-05-30 10:32

本文選題：網(wǎng)絡(luò)爬蟲 + Hadoop��；參考：《南京郵電大學(xué)》2017年碩士論文

【摘要】：互聯(lián)網(wǎng)的飛速發(fā)展帶來了互聯(lián)網(wǎng)內(nèi)容信息的爆炸式增長,同時(shí)如此高的信息數(shù)量級也給從其中獲取自己所需要的信息帶來了巨大挑戰(zhàn)。面對如此巨大的信息檢索以及用戶的個(gè)性化檢索需求,如何提高網(wǎng)絡(luò)信息搜索的效率與準(zhǔn)確率成為一個(gè)急需解決的關(guān)鍵問題,在網(wǎng)絡(luò)信息搜索技術(shù)中,網(wǎng)絡(luò)爬蟲技術(shù)是其重要的組成部分。在依靠單個(gè)計(jì)算機(jī)難以完成如此龐大的任務(wù)的背景下,使用Hadoop云平臺實(shí)現(xiàn)分布式計(jì)算與存儲,在Hadoop平臺上運(yùn)行改進(jìn)后的網(wǎng)絡(luò)爬蟲技術(shù)以達(dá)到高效、準(zhǔn)確地抓取信息�；贖adoop云平臺和網(wǎng)路爬蟲技術(shù)的深入研究,發(fā)現(xiàn)現(xiàn)有主題爬取算法的不足并對其進(jìn)行改進(jìn),提出優(yōu)化特征詞提取、基于語義樹改進(jìn)相關(guān)度計(jì)算、基于權(quán)重優(yōu)化鏈接排序的主題爬取算法,并在云平臺上進(jìn)行MapReduce處理,提高主題爬取算法的效率與準(zhǔn)確率。針對鏈接去重,提出了一種基于布隆過濾器改進(jìn)的鏈接去重算法,在優(yōu)化布隆過濾器的存儲結(jié)構(gòu)上,基于屬性對鏈接分層,形成分層布隆過濾器樹對鏈接進(jìn)行快速準(zhǔn)確去重,在云平臺上進(jìn)行處理,改進(jìn)算法性能和時(shí)空間效率,最終得到更有效、更精準(zhǔn)的鏈接去重算法。在研究Hadoop網(wǎng)絡(luò)爬蟲系統(tǒng)原理的基礎(chǔ)上構(gòu)建系統(tǒng),并詳細(xì)設(shè)計(jì)實(shí)現(xiàn)系統(tǒng)的網(wǎng)頁下載模塊、網(wǎng)頁文檔解析模塊、鏈接處理模塊,將所提出的改進(jìn)算法應(yīng)用在關(guān)鍵功能模塊的實(shí)現(xiàn)中。在構(gòu)建系統(tǒng)的基礎(chǔ)上,實(shí)驗(yàn)驗(yàn)證所提出的改進(jìn)算法,結(jié)果表明其在算法性能和效率提高方面可行有效。
[Abstract]:The rapid development of the Internet has brought the explosive growth of Internet content information, at the same time, such a high level of information has also brought great challenges to get the information they need from it. In the face of the huge demand of information retrieval and personalized retrieval of users, how to improve the efficiency and accuracy of network information search becomes a key problem that needs to be solved urgently. In the network information search technology, how to improve the efficiency and accuracy of network information search has become a key problem. Web crawler technology is an important part of it. Under the background that it is difficult to accomplish such a huge task by relying on a single computer, distributed computing and storage are realized by using Hadoop cloud platform, and the improved network crawler technology is run on Hadoop platform in order to efficiently and accurately capture information. Based on the deep research of Hadoop cloud platform and web crawler technology, this paper finds out the deficiency of the existing topic crawling algorithm and improves it, proposes the optimized feature word extraction, and improves the correlation calculation based on semantic tree. The topic crawling algorithm based on weight optimization link sorting and MapReduce processing on cloud platform can improve the efficiency and accuracy of the topic crawling algorithm. In this paper, an improved link removal algorithm based on Bron filter is proposed. In order to optimize the storage structure of Bron filter, the link is stratified based on attributes, and a hierarchical Bron filter tree is formed to remove the link quickly and accurately. The algorithm is processed on the cloud platform to improve the algorithm performance and time space efficiency, and finally get more effective and accurate link removal algorithm. On the basis of studying the principle of Hadoop web crawler system, the system is constructed, and the web page download module, the page document analysis module and the link processing module are designed and implemented in detail. The proposed improved algorithm is applied to the implementation of the key function module. On the basis of constructing the system, the proposed improved algorithm is verified by experiments. The results show that the proposed algorithm is feasible and effective in improving the performance and efficiency of the algorithm.
【學(xué)位授予單位】：南京郵電大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 郭建華;楊洪斌;陳圣波;;基于HDFS的海量視頻數(shù)據(jù)重分布算法[J];計(jì)算機(jī)科學(xué);2016年S1期

2 彭麗針;吳揚(yáng)揚(yáng);;基于維基百科社區(qū)挖掘的詞語語義相似度計(jì)算[J];計(jì)算機(jī)科學(xué);2016年04期

3 陶耀東;向中希;;基于改進(jìn)Kademlia協(xié)議的分布式爬蟲[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2016年04期

4 宋寶燕;王俊陸;王妍;;基于范德蒙碼的HDFS優(yōu)化存儲策略研究[J];計(jì)算機(jī)學(xué)報(bào);2015年09期

5 宋杰;徐澍;郭朝鵬;鮑玉斌;于戈;;一種優(yōu)化MapReduce系統(tǒng)能耗的任務(wù)分發(fā)算法[J];計(jì)算機(jī)學(xué)報(bào);2016年02期

6 王鵬超;杜慧敏;曹廣界;杜琴琴;丁家隆;;基于布隆過濾器的精確匹配算法設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)科學(xué);2015年S1期

7 孔濤;曹丙章;邱荷花;;基于MapReduce的視頻爬蟲系統(tǒng)研究[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年05期

8 李璐;張國印;李正文;;基于SVM的主題爬蟲技術(shù)研究[J];計(jì)算機(jī)科學(xué);2015年02期

9 于娟;劉強(qiáng);;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計(jì)算機(jī)工程與科學(xué);2015年02期

10 嚴(yán)磊;丁賓;姚志敏;馬勇男;鄭濤;;基于MD5去重樹的網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與優(yōu)化[J];計(jì)算機(jī)應(yīng)用與軟件;2015年02期

相關(guān)博士學(xué)位論文前1條

1 張智雄;Internet科技信息資源門戶網(wǎng)站(STIP)系統(tǒng)的實(shí)踐研究[D];中國科學(xué)院文獻(xiàn)情報(bào)中心;2000年

相關(guān)碩士學(xué)位論文前2條

1 么士宇;基于分布式計(jì)算的網(wǎng)絡(luò)爬蟲技術(shù)研究[D];大連海事大學(xué);2011年

2 楊玲;面向云計(jì)算的MapReduce并行編程模式的研究與應(yīng)用[D];湖南大學(xué);2011年

，

本文編號：1954977

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/1954977.html

上一篇：基于特殊網(wǎng)絡(luò)的路由器關(guān)鍵技術(shù)設(shè)計(jì)和實(shí)現(xiàn)
下一篇：基于SOA的體育物流能力管理平臺的研究與設(shè)計(jì)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop平臺的網(wǎng)絡(luò)爬蟲技術(shù)研究