基于HDFS的小文件處理與副本策略優(yōu)化研究
發(fā)布時(shí)間:2018-04-21 11:35
本文選題:HDFS + 小文件處理 ; 參考:《中國(guó)海洋大學(xué)》2014年碩士論文
【摘要】:作為GFS的開源實(shí)現(xiàn),Hadoop Distributed File System (HDFS)在大文件的處理上表現(xiàn)突出,然而在處理小文件時(shí)卻效率低下,主要因?yàn)楹A啃∥募浅:馁M(fèi)NameNode節(jié)點(diǎn)的內(nèi)存,從而使得單一的NameNode節(jié)點(diǎn)容易成為整個(gè)集群的性能瓶頸。 此外,HDFS采用靜態(tài)三副本策略,以機(jī)架感知的方式確定副本的存放位置。這一策略雖然可以部分實(shí)現(xiàn)容錯(cuò)和負(fù)載均衡,但缺陷也非常明顯,策略過(guò)于僵化,不僅造成較大的存儲(chǔ)資源浪費(fèi),而且負(fù)載均衡效果也不理想。 針對(duì)HDFS處理小文件時(shí)存在的不足,本文提出了基于索引機(jī)制的小文件處理優(yōu)化方案,核心思想是通過(guò)DataNode部分替代NameNode的作用,以分散小文件處理的壓力,解決HDFS在大量請(qǐng)求下的單NameNode瓶頸問(wèn)題,同時(shí)引入緩存策略,進(jìn)一步優(yōu)化文件讀取效率。此外,為了實(shí)現(xiàn)均衡存儲(chǔ),本文提出了DataNode節(jié)點(diǎn)綜合量化指標(biāo),并在此基礎(chǔ)上提出了動(dòng)態(tài)副本策略,實(shí)現(xiàn)了動(dòng)態(tài)副本放置算法。歸納整個(gè)研究過(guò)程,本文主要取得了以下幾點(diǎn)創(chuàng)新成果: 1、針對(duì)HDFS處理小文件效率低下的問(wèn)題,本文提出了更為通用的基于索引機(jī)制的小文件處理優(yōu)化方案,實(shí)現(xiàn)了小文件的分布式處理,,降低了NameNode節(jié)點(diǎn)的瓶頸效應(yīng),提升了小文件的處理效率; 2、在索引方案基礎(chǔ)上,本文將緩存策略引入文件讀取過(guò)程中,實(shí)現(xiàn)了分布式獨(dú)立緩存,優(yōu)化了HDFS的I/O操作,提高了HDFS文件讀取速度; 3、針對(duì)HDFS原有的靜態(tài)三副本策略導(dǎo)致存儲(chǔ)效率低,存儲(chǔ)分布不均衡的問(wèn)題,本文提出了新的動(dòng)態(tài)副本策略,通過(guò)多項(xiàng)指標(biāo)綜合量化DataNode節(jié)點(diǎn)的性能,實(shí)現(xiàn)了動(dòng)態(tài)副本放置算法,提高了集群的均衡性和存儲(chǔ)效率。 在測(cè)試集群上的實(shí)驗(yàn)結(jié)果表明,無(wú)論是基于索引機(jī)制的小文件優(yōu)化方案,還是動(dòng)態(tài)副本策略,相對(duì)原始的HDFS系統(tǒng),在性能上均有了較大改善,相對(duì)已有優(yōu)化方案也有較明顯的優(yōu)勢(shì)。
[Abstract]:As an open source implementation of GFS, Hadoop Distributed File System (HDFSs) is outstanding in the processing of large files. However, it is inefficient in processing small files, mainly because large numbers of small files consume the memory of NameNode nodes. Thus, a single NameNode node is easy to become the performance bottleneck of the whole cluster. In addition, HDFS adopts a static three-copy strategy to determine the storage location of the replica in a rack-aware manner. Although this strategy can partially implement fault tolerance and load balancing, its shortcomings are also very obvious. The strategy is too rigid, which not only causes a large waste of storage resources, but also does not have an ideal load balancing effect. In view of the shortcomings of HDFS in dealing with small files, this paper proposes an optimization scheme of small file processing based on index mechanism. The core idea is to replace the role of NameNode partly by DataNode in order to disperse the pressure of small file processing. To solve the single NameNode bottleneck problem of HDFS under a large number of requests, a cache policy is introduced to further optimize the efficiency of file reading. In addition, in order to achieve balanced storage, this paper proposes a comprehensive quantization index of DataNode nodes, and then proposes a dynamic replica strategy to implement the dynamic replica placement algorithm. Summing up the whole research process, this paper mainly achieved the following innovative results: 1. Aiming at the problem of low efficiency in HDFS processing of small files, this paper proposes a more general optimization scheme of small file processing based on index mechanism, which realizes the distributed processing of small files and reduces the bottleneck effect of NameNode nodes. Improve the processing efficiency of small files; 2. On the basis of index scheme, this paper introduces cache policy into the process of file reading, realizes distributed independent cache, optimizes I / O operation of HDFS, and improves the speed of HDFS file reading. 3. In view of the low storage efficiency and uneven storage distribution caused by HDFS's original static three-replica strategy, this paper proposes a new dynamic replica strategy, which quantifies the performance of DataNode nodes by multiple indexes, and realizes the dynamic replica placement algorithm. The balance and storage efficiency of cluster are improved. The experimental results on the test cluster show that the performance of both the small file optimization scheme based on index mechanism and the dynamic replica strategy has been greatly improved compared with the original HDFS system. Compared with the existing optimization scheme, it also has obvious advantages.
【學(xué)位授予單位】:中國(guó)海洋大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.06
【參考文獻(xiàn)】
相關(guān)期刊論文 前9條
1 王禹;趙躍龍;侯f ;;基于副本管理的P2P存儲(chǔ)系統(tǒng)可靠性分析[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年02期
2 楊德志,黃華,張建剛,許魯;大容量、高性能、高擴(kuò)展能力的藍(lán)鯨分布式文件系統(tǒng)[J];計(jì)算機(jī)研究與發(fā)展;2005年06期
3 侯孟書;王曉斌;盧顯良;任立勇;;一種新的動(dòng)態(tài)副本管理機(jī)制[J];計(jì)算機(jī)科學(xué);2006年09期
4 陳劍;龔發(fā)根;;一種優(yōu)化分布式文件系統(tǒng)的文件合并策略[J];計(jì)算機(jī)應(yīng)用;2011年S2期
5 黃曉濤;李志永;;P2P網(wǎng)中基于文件分片的副本建立策略[J];計(jì)算機(jī)仿真;2008年01期
6 李曉愷;代翔;李文杰;崔U
本文編號(hào):1782306
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1782306.html
最近更新
教材專著