天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于Nutch的分布式搜索引擎的研究與優(yōu)化

發(fā)布時(shí)間:2018-07-10 20:24

  本文選題:Nutch + 索引 ; 參考:《武漢理工大學(xué)》2013年碩士論文


【摘要】:云計(jì)算已發(fā)展成為目前計(jì)算機(jī)產(chǎn)業(yè)界和學(xué)術(shù)界關(guān)注的熱點(diǎn)之一,Hadoop,作為當(dāng)今最流行的云計(jì)算平臺(tái),也得到了越來(lái)越廣泛的應(yīng)用。與此同時(shí),開(kāi)放源代碼搜索引擎包Nutch不僅能提供搜索引擎所需要的工具,還具有極好的擴(kuò)展性,越來(lái)越多的學(xué)者圍繞Hadoop和Nutch的結(jié)合展開(kāi)研究,力圖通過(guò)各種途徑來(lái)提高分布式搜索的性能,本文正是在這些學(xué)者的研究成果上,開(kāi)展了基于Nutch和Hadoop的分布式搜索引擎的研究和優(yōu)化等相關(guān)工作。 本文具體研究工作包括:Nutch框架、Hadoop分布式平臺(tái)和分布式爬蟲(chóng)原理三個(gè)方面。首先,對(duì)Nutch框架和Hadoop分布式平臺(tái)進(jìn)行了分析和研究,仔細(xì)剖析了其架構(gòu)及主要工作原理,如索引機(jī)制、插件機(jī)制、HDFS,Map/Reduce等核心技術(shù)。接著重點(diǎn)研究了爬蟲(chóng)技術(shù),特別是分布式爬蟲(chóng)技術(shù),通過(guò)分析和研究現(xiàn)有的基于Nutch的爬取機(jī)制,從改變數(shù)據(jù)結(jié)構(gòu)入手,在任務(wù)分配算法中引入可擴(kuò)展的哈希函數(shù),從而解決了原有算法中負(fù)載均衡性和算法低效率的問(wèn)題。 在上述研究工作的基礎(chǔ)上,本文設(shè)計(jì)了基于Nutch和Hadoop的分布式搜索系統(tǒng),在所設(shè)計(jì)系統(tǒng)的索引模塊中采用了可擴(kuò)展的hash函數(shù),在索引和搜索模塊中利用Nutch的可擴(kuò)展性,通過(guò)引入中科院的漢語(yǔ)詞法分析系統(tǒng)ICTCLAS,有效地改進(jìn)了Nutch對(duì)中文的支持力。 最后,本文對(duì)所設(shè)計(jì)的搜索系統(tǒng),在實(shí)驗(yàn)室構(gòu)建的集群基礎(chǔ)上,從多個(gè)角度進(jìn)行了功能測(cè)試、性能測(cè)試和綜合評(píng)估,測(cè)試結(jié)果不僅驗(yàn)證了所設(shè)計(jì)的系統(tǒng)的可行性和可擴(kuò)展性,還驗(yàn)證了其性能的提升。
[Abstract]:Cloud computing has become one of the hot topics in computer industry and academia. As the most popular cloud computing platform, cloud computing has been more and more widely used. At the same time, the open source search engine package Nutch not only provides the tools that search engines need, but also has excellent expansibility. More and more scholars are studying the combination of Hadoop and Nutch. This paper tries to improve the performance of distributed search engine through various ways. In this paper, the research and optimization of distributed search engine based on Nutch and Hadoop are carried out. The research work in this paper includes three aspects: Hadoop distributed platform and distributed crawler principle. Firstly, the Nutch framework and Hadoop distributed platform are analyzed and studied, and its architecture and main working principles are analyzed in detail, such as index mechanism, plug-in mechanism, HDFSMapP / Reduce and other core technologies. Then, the crawler technology, especially the distributed crawler technology, is studied emphatically. By analyzing and studying the existing crawling mechanism based on Nutch, the scalable hash function is introduced into the task assignment algorithm by changing the data structure. Thus, the problems of load balance and low efficiency of the original algorithm are solved. Based on the above research work, a distributed search system based on Nutch and Hadoop is designed. The extensible hash function is used in the index module of the designed system, and the extensibility of Nutch is used in the index and search module. By introducing the Chinese lexical analysis system (ICTCLASS) of the Chinese Academy of Sciences (CAS), Nutch's support for Chinese is improved effectively. Finally, on the basis of the cluster constructed in the laboratory, the function test, performance test and comprehensive evaluation of the designed search system are carried out. The test results not only verify the feasibility and expansibility of the designed system. The improvement of its performance is also verified.
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 潘以鋒;;基于Lucene的網(wǎng)站全文檢索系統(tǒng)的開(kāi)發(fā)[J];廣西教育學(xué)院學(xué)報(bào);2006年05期

2 張?jiān)S;董守斌;張凌;陳曉志;;基于Map/Reduce的網(wǎng)頁(yè)消重并行算法[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年02期

3 張嶺,葉允明,宋暉,于水,馬范援;一種高性能分布式Web Crawler的設(shè)計(jì)與實(shí)現(xiàn)[J];上海交通大學(xué)學(xué)報(bào);2004年01期

相關(guān)碩士學(xué)位論文 前6條

1 董長(zhǎng)春;基于Hadoop的倒排索引技術(shù)的研究[D];遼寧大學(xué);2011年

2 蘇旋;分布式網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的研究與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2006年

3 朱珠;基于Hadoop的海量數(shù)據(jù)處理模型研究和應(yīng)用[D];北京郵電大學(xué);2008年

4 時(shí)延軍;基于Nutch的分布式搜索引擎的設(shè)計(jì)與研究[D];長(zhǎng)春理工大學(xué);2010年

5 程錦佳;基于Hadoop的分布式爬蟲(chóng)及其實(shí)現(xiàn)[D];北京郵電大學(xué);2010年

6 吳翠雁;基于Nutch的信息采集系統(tǒng)的研究與實(shí)現(xiàn)[D];華南理工大學(xué);2010年



本文編號(hào):2114599

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2114599.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶bc362***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com