基于Docker技術(shù)的全文搜索引擎的研究與應(yīng)用
本文選題:Hadoop + Map/Reduce ; 參考:《南京郵電大學(xué)》2017年碩士論文
【摘要】:隨著計(jì)算機(jī)世界第三次革命浪潮的興起。在這次浪潮中,云計(jì)算與大數(shù)據(jù)大量的應(yīng)用,使得數(shù)據(jù)的處理已經(jīng)躍升至TB乃至PB級,并同時針對這些數(shù)據(jù)進(jìn)行更快捷、更高效的處理。因此,在云計(jì)算概念上衍生而來的各種大數(shù)據(jù)處理方法與技術(shù),業(yè)已成為此次浪潮中主流[20]。而Hadoop平臺作為此次浪潮中應(yīng)用最廣泛的大數(shù)據(jù)處理平臺,構(gòu)建在基于虛擬化技術(shù)的Hadoop架構(gòu)全文搜索引擎的上基礎(chǔ)上,有著運(yùn)行穩(wěn)定、經(jīng)濟(jì)、便于管理、存儲和計(jì)算的優(yōu)勢。本文在全文搜索引擎的搭建方面,首先通過分析和總結(jié)當(dāng)前幾種分布式搜索引擎的優(yōu)缺點(diǎn)后,提出基于Hadoop平臺的分布式搜索引擎,然后分析傳統(tǒng)服務(wù)器部署的局限性并比較傳統(tǒng)的虛擬化技術(shù)與Docker容器技術(shù)在處理性能方面的優(yōu)劣,從而使用Docker容器作為Hadoop平臺底層架構(gòu)來搭建Hadoop平臺,以便優(yōu)化Hadoop平臺的性能。接著,對分布式搜索引擎的爬行、索引、查詢?nèi)齻子系統(tǒng)進(jìn)行研究,并應(yīng)用Map/Reduce的并行算法思想,使Map函數(shù)封裝數(shù)據(jù)計(jì)算任務(wù)、Reduce函數(shù)封裝數(shù)據(jù)合并任務(wù)。此外,系統(tǒng)在全文檢索方面使用了基于倒排文檔的技術(shù)并結(jié)合TF-IDF(Term frequency inverse document frequency)和PageRank算法進(jìn)行相關(guān)度計(jì)算,優(yōu)化檢索方法。同時,經(jīng)過底層Docker容器可以更方便的進(jìn)行搜索引擎的部署和移植。基于以上研究,本文先通過對比實(shí)驗(yàn),驗(yàn)證了與傳統(tǒng)虛擬技術(shù)相比,Docker在讀寫性能方面的優(yōu)勢。接著,設(shè)計(jì)與優(yōu)化了Hadoop在Docker容器集群的部署方案;谝陨蟽牲c(diǎn),設(shè)計(jì)與構(gòu)建了一個基于Docker技術(shù)的Hadoop架構(gòu)的全文搜索引擎系統(tǒng),并對系統(tǒng)的性能、可靠性、可擴(kuò)展性進(jìn)行測試。通過對獲取的實(shí)驗(yàn)數(shù)據(jù)進(jìn)行分析,驗(yàn)證了基于Docker技術(shù)的Hadoop架構(gòu)的全文搜索引擎的合理性與正確性。
[Abstract]:With the rise of the third wave of revolution in the computer world. In this wave, cloud computing and big data applications make the data processing has jumped to TB and even PB level, and at the same time for these data faster and more efficient processing. Therefore, various big data processing methods and technologies derived from cloud computing concepts have become the mainstream of this wave [20]. The Hadoop platform, as the most widely used big data processing platform in this wave, is built on the basis of the full-text search engine of Hadoop architecture based on virtualization technology. It has the advantages of stable operation, economy, easy management, storage and computing. In the construction of full-text search engine, first of all, by analyzing and summarizing the advantages and disadvantages of several kinds of distributed search engines, a distributed search engine based on Hadoop platform is proposed. Then the limitations of traditional server deployment are analyzed and the advantages and disadvantages of traditional virtualization technology and Docker container technology in handling performance are compared so that the Docker container is used as the Hadoop platform infrastructure to build Hadoop platform in order to optimize the performance of Hadoop platform. Then, the crawling, indexing and querying subsystems of distributed search engine are studied, and the parallel algorithm of Map/Reduce is applied to make Map function encapsulate data computing task and reduce function encapsulate data merge task. In addition, in the aspect of full-text retrieval, the technology based on inverted documents is used to optimize the retrieval method by combining TF-IDF(Term frequency inverse document frequency) and PageRank algorithm to calculate the correlation degree. At the same time, through the underlying Docker container can be more convenient to deploy and transplant search engines. Based on the above research, this paper verifies the advantages of Docker in reading and writing performance compared with traditional virtual technology through comparative experiments. Then, the deployment scheme of Hadoop in Docker container cluster is designed and optimized. Based on the above two points, a full-text search engine system with Hadoop architecture based on Docker technology is designed and constructed, and the performance, reliability and extensibility of the system are tested. Through the analysis of the obtained experimental data, the rationality and correctness of the full-text search engine based on Hadoop architecture based on Docker technology are verified.
【學(xué)位授予單位】:南京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 張忠琳;黃炳良;;基于openstack云平臺的docker應(yīng)用[J];軟件;2014年11期
2 楊彬;;分布式文件系統(tǒng)HDFS處理小文件的優(yōu)化方案[J];軟件;2014年06期
3 田野;蘇紅旗;田棟;;Hadoop下海量遙感數(shù)據(jù)的處理[J];軟件;2014年03期
4 朱娜娜;;Hadoop平臺的集群故障監(jiān)控的研究與實(shí)現(xiàn)[J];軟件;2013年12期
5 李冠辰;;一個基于hadoop的并行社交網(wǎng)絡(luò)挖掘系統(tǒng)[J];軟件;2013年12期
6 高東海;李文生;張海濤;;基于Hadoop的離線視頻處理技術(shù)研究與實(shí)現(xiàn)[J];軟件;2013年11期
7 鄭欣杰;朱程榮;熊齊邦;;基于MapReduce的分布式光線跟蹤的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2007年22期
8 屈培;葛蓁;;Nutch-0.8.1中二分法中文分詞的實(shí)現(xiàn)[J];計(jì)算機(jī)時代;2007年07期
9 蔣建洪;;主要分布式搜索引擎技術(shù)的研究[J];科學(xué)技術(shù)與工程;2007年10期
10 管建和;甘劍峰;;基于Lucene全文檢索引擎的應(yīng)用研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2007年02期
相關(guān)碩士學(xué)位論文 前8條
1 陳光景;Hadoop小文件處理技術(shù)的研究和實(shí)現(xiàn)[D];南京郵電大學(xué);2013年
2 么士宇;基于分布式計(jì)算的網(wǎng)絡(luò)爬蟲技術(shù)研究[D];大連海事大學(xué);2011年
3 葉海;分布式主題搜索引擎的研究與實(shí)現(xiàn)[D];華南理工大學(xué);2011年
4 邱偉林;面向領(lǐng)域的垂直搜索引擎的研究與實(shí)現(xiàn)[D];大連海事大學(xué);2011年
5 金川明;垂直搜索引擎研究與實(shí)現(xiàn)[D];云南大學(xué);2011年
6 劉鳳靈;基于Nutch的漏洞垂直搜索引擎[D];北京郵電大學(xué);2011年
7 李元乾;基于移動搜索用戶關(guān)聯(lián)的信息檢索研究[D];北京交通大學(xué);2010年
8 林碧霞;基于領(lǐng)域本體的主題爬蟲研究及實(shí)現(xiàn)[D];西南交通大學(xué);2010年
,本文編號:1949662
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1949662.html