基于Hadoop的搜索引擎的研究與應(yīng)用
發(fā)布時間:2018-04-27 00:17
本文選題:搜索引擎 + Hadoop。 參考:《浙江理工大學》2013年碩士論文
【摘要】:隨著網(wǎng)絡(luò)信息技術(shù)的大規(guī)模普及,用戶對于信息檢索的要求日益嚴格。實現(xiàn)快速、準確且全面的信息搜索能為各類機構(gòu)獲得較高的客戶滿意度和良好的商業(yè)效益。由于技術(shù)和經(jīng)濟實力受限,大多數(shù)中小型機構(gòu)難以像大型機構(gòu)那樣根據(jù)用戶需求實現(xiàn)專有的高效搜索體系,也難以結(jié)合中小型機構(gòu)自身的需求作進一步的個性化設(shè)計。因此如何有效利用現(xiàn)有搜索引擎巨頭的技術(shù),,為更多機構(gòu),尤其是具備一定數(shù)據(jù)集,但經(jīng)濟承載力較小、核心開發(fā)能力較弱的中小型企業(yè)、高校及科研機構(gòu)等提供強大的搜索計算技術(shù)和多樣化服務(wù),成為當前搜索領(lǐng)域的研究重點和難點。 本文結(jié)合實際應(yīng)用需求,研究基于Hadoop的分布式搜索引擎原理、相關(guān)技術(shù)和算法,深入剖析分布式計算框架MapReduce和分布式文件系統(tǒng)HDFS,引入MapReduce編程模型的具體設(shè)計方案,將BM25排序模型集成于Lucene實現(xiàn)檢索評分,采用Paoding分詞器做中文分詞處理,完成了系統(tǒng)在Hadoop平臺的架構(gòu)設(shè)計,確定了系統(tǒng)功能劃分,分析并設(shè)計爬行、索引和檢索流程,完成了三個子系統(tǒng)的改進與實現(xiàn)。 在分析、評價和總結(jié)中小型機構(gòu)實現(xiàn)信息高效搜索的需求和現(xiàn)存弊端的基礎(chǔ)之上,本文集成三個相對獨立的子系統(tǒng)的設(shè)計與實現(xiàn),完成了Hadoop框架搭建和相關(guān)配置,部署實現(xiàn)了3個節(jié)點的分布式搜索引擎系統(tǒng)。最后從中小型機構(gòu)用戶的搜索需求出發(fā),對本系統(tǒng)性能進行測試與評價。具體以浙江理工大學網(wǎng)站作為實驗對象,在三節(jié)點的分布式平臺與單機環(huán)境下考察系統(tǒng)進行網(wǎng)頁爬取和索引的效率。爬行和索引用時計算結(jié)果表明,對于20000個網(wǎng)頁,集群用時相比單機節(jié)省約15.64%。隨著網(wǎng)頁數(shù)量的增加,該差異逐漸擴大。同時通過比較不同網(wǎng)頁數(shù)對應(yīng)的檢索結(jié)果匹配度,計算得出基于Hadoop的分布式搜索引擎系統(tǒng)檢索的平均準確率較單機環(huán)境提升了近20%。實驗結(jié)果表明,在機構(gòu)網(wǎng)頁量增加到一定程度后,該面向中小型機構(gòu)的分布式搜索引擎系統(tǒng)較傳統(tǒng)集中式搜索引擎能更快速獲取用戶需要的更加精準的檢索結(jié)果且系統(tǒng)安全穩(wěn)定性和可擴展性得到提升,從而改善了中小型機構(gòu)信息檢索效能,加快其信息化程度。
[Abstract]:With the widespread popularity of network information technology, users are increasingly demanding information retrieval. Fast, accurate and comprehensive information search can achieve high customer satisfaction and good commercial benefits for various institutions. Because of limited technical and economic strength, most small and medium-sized institutions are difficult to use as large institutions. The user needs to realize the exclusive efficient search system, and it is difficult to make further personalized design in combination with the needs of the small and medium-sized institutions. Therefore, how to effectively use the technology of the existing search engine giant for more organizations, especially the small and medium enterprises with a certain data set, but small economic carrying capacity and weak core development ability And scientific research institutions provide powerful search and computing technology and diversified services, which become the focus and difficulty of the current search field.
This paper studies the principle of distributed search engine based on Hadoop, related technologies and algorithms, analyzes distributed computing framework MapReduce and distributed file system HDFS, and introduces the specific design scheme of MapReduce programming model. The BM25 sorting model is set in Lucene to achieve the retrieval score, and the Paoding participle is adopted. In Chinese word segmentation processing, the architecture design of the system in the Hadoop platform is completed, the system function is divided, the crawl, index and retrieval process are analyzed and designed, and the improvement and implementation of the three subsystems are completed.
Based on the analysis, evaluation and summary of the needs and existing drawbacks of the small and medium institutions to achieve efficient information search, this paper integrates the design and implementation of three relatively independent subsystems, completes the construction of the Hadoop framework and the related configuration, and deploys the distributed search engine system of 3 nodes. Finally, the users of small and medium institutions have been implemented. The performance of the system is tested and evaluated. The efficiency of web crawling and indexing is carried out on the three node distributed platform and single machine environment. The results of crawling and cable reference show that for the 20000 web pages, the clustering is compared to single machine savings. With the increase of the number of web pages, the difference is expanding gradually. At the same time, the average accuracy of the distributed search engine system based on Hadoop is calculated by comparing the matching degree of the retrieval results of different web pages. The results show that the average accuracy of the search engine system based on the Hadoop based distributed search engine is improved by the experimental results of the 20%. experiment. The distributed search engine system oriented to small and medium-sized institutions can get more accurate retrieval results more quickly than the traditional centralized search engine, and improve the security stability and scalability of the system, thus improving the efficiency of information retrieval in small and medium institutions and speeding up its information level.
【學位授予單位】:浙江理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前10條
1 夏天;;Nutch的插件機制分析[J];廣西師范大學學報(自然科學版);2010年01期
2 胡長春;劉功申;;面向搜索引擎Lucene的中文分析器[J];計算機工程與應(yīng)用;2009年12期
3 孫殿哲;魏海平;陳巖;;Nutch中庖丁解牛中文分詞的實現(xiàn)與評測[J];計算機與現(xiàn)代化;2010年06期
4 蔣建洪;;主要分布式搜索引擎技術(shù)的研究[J];科學技術(shù)與工程;2007年10期
5 陸興;八個著名中文搜索引擎的特征及其評析[J];圖書館理論與實踐;2003年02期
6 岳珍;四大中文搜索引擎檢索性能測評[J];情報科學;2005年06期
7 段旭良;;中小企業(yè)電子商務(wù)網(wǎng)站站內(nèi)搜索引擎的設(shè)計與應(yīng)用[J];商場現(xiàn)代化;2009年36期
8 王衛(wèi)東;宋丹;宋人杰;;基于分解的向量空間模型的Web新聞信息檢索[J];山東大學學報(理學版);2006年03期
9 薛明;搜索引擎Google與Baidu比較[J];沈陽大學學報;2004年03期
10 杜德生;田小軍;;Lucene應(yīng)用中Pdf文檔文本數(shù)據(jù)提取方法研究[J];自動化技術(shù)與應(yīng)用;2009年03期
本文編號:1808328
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1808328.html
最近更新
教材專著