基于Hadoop平臺的網(wǎng)絡(luò)爬蟲技術(shù)研究
本文選題:網(wǎng)絡(luò)爬蟲 + Hadoop; 參考:《南京郵電大學(xué)》2017年碩士論文
【摘要】:互聯(lián)網(wǎng)的飛速發(fā)展帶來了互聯(lián)網(wǎng)內(nèi)容信息的爆炸式增長,同時如此高的信息數(shù)量級也給從其中獲取自己所需要的信息帶來了巨大挑戰(zhàn)。面對如此巨大的信息檢索以及用戶的個性化檢索需求,如何提高網(wǎng)絡(luò)信息搜索的效率與準(zhǔn)確率成為一個急需解決的關(guān)鍵問題,在網(wǎng)絡(luò)信息搜索技術(shù)中,網(wǎng)絡(luò)爬蟲技術(shù)是其重要的組成部分。在依靠單個計算機(jī)難以完成如此龐大的任務(wù)的背景下,使用Hadoop云平臺實現(xiàn)分布式計算與存儲,在Hadoop平臺上運行改進(jìn)后的網(wǎng)絡(luò)爬蟲技術(shù)以達(dá)到高效、準(zhǔn)確地抓取信息;贖adoop云平臺和網(wǎng)路爬蟲技術(shù)的深入研究,發(fā)現(xiàn)現(xiàn)有主題爬取算法的不足并對其進(jìn)行改進(jìn),提出優(yōu)化特征詞提取、基于語義樹改進(jìn)相關(guān)度計算、基于權(quán)重優(yōu)化鏈接排序的主題爬取算法,并在云平臺上進(jìn)行MapReduce處理,提高主題爬取算法的效率與準(zhǔn)確率。針對鏈接去重,提出了一種基于布隆過濾器改進(jìn)的鏈接去重算法,在優(yōu)化布隆過濾器的存儲結(jié)構(gòu)上,基于屬性對鏈接分層,形成分層布隆過濾器樹對鏈接進(jìn)行快速準(zhǔn)確去重,在云平臺上進(jìn)行處理,改進(jìn)算法性能和時空間效率,最終得到更有效、更精準(zhǔn)的鏈接去重算法。在研究Hadoop網(wǎng)絡(luò)爬蟲系統(tǒng)原理的基礎(chǔ)上構(gòu)建系統(tǒng),并詳細(xì)設(shè)計實現(xiàn)系統(tǒng)的網(wǎng)頁下載模塊、網(wǎng)頁文檔解析模塊、鏈接處理模塊,將所提出的改進(jìn)算法應(yīng)用在關(guān)鍵功能模塊的實現(xiàn)中。在構(gòu)建系統(tǒng)的基礎(chǔ)上,實驗驗證所提出的改進(jìn)算法,結(jié)果表明其在算法性能和效率提高方面可行有效。
[Abstract]:The rapid development of the Internet has brought the explosive growth of Internet content information, at the same time, such a high level of information has also brought great challenges to get the information they need from it. In the face of the huge demand of information retrieval and personalized retrieval of users, how to improve the efficiency and accuracy of network information search becomes a key problem that needs to be solved urgently. In the network information search technology, how to improve the efficiency and accuracy of network information search has become a key problem. Web crawler technology is an important part of it. Under the background that it is difficult to accomplish such a huge task by relying on a single computer, distributed computing and storage are realized by using Hadoop cloud platform, and the improved network crawler technology is run on Hadoop platform in order to efficiently and accurately capture information. Based on the deep research of Hadoop cloud platform and web crawler technology, this paper finds out the deficiency of the existing topic crawling algorithm and improves it, proposes the optimized feature word extraction, and improves the correlation calculation based on semantic tree. The topic crawling algorithm based on weight optimization link sorting and MapReduce processing on cloud platform can improve the efficiency and accuracy of the topic crawling algorithm. In this paper, an improved link removal algorithm based on Bron filter is proposed. In order to optimize the storage structure of Bron filter, the link is stratified based on attributes, and a hierarchical Bron filter tree is formed to remove the link quickly and accurately. The algorithm is processed on the cloud platform to improve the algorithm performance and time space efficiency, and finally get more effective and accurate link removal algorithm. On the basis of studying the principle of Hadoop web crawler system, the system is constructed, and the web page download module, the page document analysis module and the link processing module are designed and implemented in detail. The proposed improved algorithm is applied to the implementation of the key function module. On the basis of constructing the system, the proposed improved algorithm is verified by experiments. The results show that the proposed algorithm is feasible and effective in improving the performance and efficiency of the algorithm.
【學(xué)位授予單位】:南京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP393.092;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 郭建華;楊洪斌;陳圣波;;基于HDFS的海量視頻數(shù)據(jù)重分布算法[J];計算機(jī)科學(xué);2016年S1期
2 彭麗針;吳揚(yáng)揚(yáng);;基于維基百科社區(qū)挖掘的詞語語義相似度計算[J];計算機(jī)科學(xué);2016年04期
3 陶耀東;向中希;;基于改進(jìn)Kademlia協(xié)議的分布式爬蟲[J];計算機(jī)系統(tǒng)應(yīng)用;2016年04期
4 宋寶燕;王俊陸;王妍;;基于范德蒙碼的HDFS優(yōu)化存儲策略研究[J];計算機(jī)學(xué)報;2015年09期
5 宋杰;徐澍;郭朝鵬;鮑玉斌;于戈;;一種優(yōu)化MapReduce系統(tǒng)能耗的任務(wù)分發(fā)算法[J];計算機(jī)學(xué)報;2016年02期
6 王鵬超;杜慧敏;曹廣界;杜琴琴;丁家隆;;基于布隆過濾器的精確匹配算法設(shè)計與實現(xiàn)[J];計算機(jī)科學(xué);2015年S1期
7 孔濤;曹丙章;邱荷花;;基于MapReduce的視頻爬蟲系統(tǒng)研究[J];華中科技大學(xué)學(xué)報(自然科學(xué)版);2015年05期
8 李璐;張國印;李正文;;基于SVM的主題爬蟲技術(shù)研究[J];計算機(jī)科學(xué);2015年02期
9 于娟;劉強(qiáng);;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計算機(jī)工程與科學(xué);2015年02期
10 嚴(yán)磊;丁賓;姚志敏;馬勇男;鄭濤;;基于MD5去重樹的網(wǎng)絡(luò)爬蟲的設(shè)計與優(yōu)化[J];計算機(jī)應(yīng)用與軟件;2015年02期
相關(guān)博士學(xué)位論文 前1條
1 張智雄;Internet科技信息資源門戶網(wǎng)站(STIP)系統(tǒng)的實踐研究[D];中國科學(xué)院文獻(xiàn)情報中心;2000年
相關(guān)碩士學(xué)位論文 前2條
1 么士宇;基于分布式計算的網(wǎng)絡(luò)爬蟲技術(shù)研究[D];大連海事大學(xué);2011年
2 楊玲;面向云計算的MapReduce并行編程模式的研究與應(yīng)用[D];湖南大學(xué);2011年
,本文編號:1954977
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1954977.html