基于Docker技術的全文搜索引擎的研究與應用

發(fā)布時間：2018-05-29 05:25

本文選題：Hadoop + Map/Reduce　；參考：《南京郵電大學》2017年碩士論文

【摘要】：隨著計算機世界第三次革命浪潮的興起。在這次浪潮中,云計算與大數(shù)據(jù)大量的應用,使得數(shù)據(jù)的處理已經(jīng)躍升至TB乃至PB級,并同時針對這些數(shù)據(jù)進行更快捷、更高效的處理。因此,在云計算概念上衍生而來的各種大數(shù)據(jù)處理方法與技術,業(yè)已成為此次浪潮中主流[20]。而Hadoop平臺作為此次浪潮中應用最廣泛的大數(shù)據(jù)處理平臺,構建在基于虛擬化技術的Hadoop架構全文搜索引擎的上基礎上,有著運行穩(wěn)定、經(jīng)濟、便于管理、存儲和計算的優(yōu)勢。本文在全文搜索引擎的搭建方面,首先通過分析和總結當前幾種分布式搜索引擎的優(yōu)缺點后,提出基于Hadoop平臺的分布式搜索引擎,然后分析傳統(tǒng)服務器部署的局限性并比較傳統(tǒng)的虛擬化技術與Docker容器技術在處理性能方面的優(yōu)劣,從而使用Docker容器作為Hadoop平臺底層架構來搭建Hadoop平臺,以便優(yōu)化Hadoop平臺的性能。接著,對分布式搜索引擎的爬行、索引、查詢三個子系統(tǒng)進行研究,并應用Map/Reduce的并行算法思想,使Map函數(shù)封裝數(shù)據(jù)計算任務、Reduce函數(shù)封裝數(shù)據(jù)合并任務。此外,系統(tǒng)在全文檢索方面使用了基于倒排文檔的技術并結合TF-IDF(Term frequency inverse document frequency)和PageRank算法進行相關度計算,優(yōu)化檢索方法。同時,經(jīng)過底層Docker容器可以更方便的進行搜索引擎的部署和移植�；谝陨涎芯�,本文先通過對比實驗,驗證了與傳統(tǒng)虛擬技術相比,Docker在讀寫性能方面的優(yōu)勢。接著,設計與優(yōu)化了Hadoop在Docker容器集群的部署方案�；谝陨蟽牲c,設計與構建了一個基于Docker技術的Hadoop架構的全文搜索引擎系統(tǒng),并對系統(tǒng)的性能、可靠性、可擴展性進行測試。通過對獲取的實驗數(shù)據(jù)進行分析,驗證了基于Docker技術的Hadoop架構的全文搜索引擎的合理性與正確性。
[Abstract]:With the rise of the third wave of revolution in the computer world. In this wave, cloud computing and big data applications make the data processing has jumped to TB and even PB level, and at the same time for these data faster and more efficient processing. Therefore, various big data processing methods and technologies derived from cloud computing concepts have become the mainstream of this wave [20]. The Hadoop platform, as the most widely used big data processing platform in this wave, is built on the basis of the full-text search engine of Hadoop architecture based on virtualization technology. It has the advantages of stable operation, economy, easy management, storage and computing. In the construction of full-text search engine, first of all, by analyzing and summarizing the advantages and disadvantages of several kinds of distributed search engines, a distributed search engine based on Hadoop platform is proposed. Then the limitations of traditional server deployment are analyzed and the advantages and disadvantages of traditional virtualization technology and Docker container technology in handling performance are compared so that the Docker container is used as the Hadoop platform infrastructure to build Hadoop platform in order to optimize the performance of Hadoop platform. Then, the crawling, indexing and querying subsystems of distributed search engine are studied, and the parallel algorithm of Map/Reduce is applied to make Map function encapsulate data computing task and reduce function encapsulate data merge task. In addition, in the aspect of full-text retrieval, the technology based on inverted documents is used to optimize the retrieval method by combining TF-IDF(Term frequency inverse document frequency) and PageRank algorithm to calculate the correlation degree. At the same time, through the underlying Docker container can be more convenient to deploy and transplant search engines. Based on the above research, this paper verifies the advantages of Docker in reading and writing performance compared with traditional virtual technology through comparative experiments. Then, the deployment scheme of Hadoop in Docker container cluster is designed and optimized. Based on the above two points, a full-text search engine system with Hadoop architecture based on Docker technology is designed and constructed, and the performance, reliability and extensibility of the system are tested. Through the analysis of the obtained experimental data, the rationality and correctness of the full-text search engine based on Hadoop architecture based on Docker technology are verified.
【學位授予單位】：南京郵電大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.3

【參考文獻】

相關期刊論文前10條

1 張忠琳;黃炳良;;基于openstack云平臺的docker應用[J];軟件;2014年11期

2 楊彬;;分布式文件系統(tǒng)HDFS處理小文件的優(yōu)化方案[J];軟件;2014年06期

3 田野;蘇紅旗;田棟;;Hadoop下海量遙感數(shù)據(jù)的處理[J];軟件;2014年03期

4 朱娜娜;;Hadoop平臺的集群故障監(jiān)控的研究與實現(xiàn)[J];軟件;2013年12期

5 李冠辰;;一個基于hadoop的并行社交網(wǎng)絡挖掘系統(tǒng)[J];軟件;2013年12期

6 高東海;李文生;張海濤;;基于Hadoop的離線視頻處理技術研究與實現(xiàn)[J];軟件;2013年11期

7 鄭欣杰;朱程榮;熊齊邦;;基于MapReduce的分布式光線跟蹤的設計與實現(xiàn)[J];計算機工程;2007年22期

8 屈培;葛蓁;;Nutch-0.8.1中二分法中文分詞的實現(xiàn)[J];計算機時代;2007年07期

9 蔣建洪;;主要分布式搜索引擎技術的研究[J];科學技術與工程;2007年10期

10 管建和;甘劍峰;;基于Lucene全文檢索引擎的應用研究與實現(xiàn)[J];計算機工程與設計;2007年02期

相關碩士學位論文前8條

1 陳光景;Hadoop小文件處理技術的研究和實現(xiàn)[D];南京郵電大學;2013年

2 么士宇;基于分布式計算的網(wǎng)絡爬蟲技術研究[D];大連海事大學;2011年

3 葉海;分布式主題搜索引擎的研究與實現(xiàn)[D];華南理工大學;2011年

4 邱偉林;面向領域的垂直搜索引擎的研究與實現(xiàn)[D];大連海事大學;2011年

5 金川明;垂直搜索引擎研究與實現(xiàn)[D];云南大學;2011年

6 劉鳳靈;基于Nutch的漏洞垂直搜索引擎[D];北京郵電大學;2011年

7 李元乾;基于移動搜索用戶關聯(lián)的信息檢索研究[D];北京交通大學;2010年

8 林碧霞;基于領域本體的主題爬蟲研究及實現(xiàn)[D];西南交通大學;2010年

，

本文編號：1949662

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1949662.html

上一篇：門戶網(wǎng)站是否還有未來
下一篇：大數(shù)據(jù)時代搜索引擎用戶的信息安全問題研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Docker技術的全文搜索引擎的研究與應用