基于HDFS的海量小文件存儲系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-06-10 00:19
本文選題:海量小文件存儲 + 分布式文件系統(tǒng); 參考:《國防科學(xué)技術(shù)大學(xué)》2012年碩士論文
【摘要】:近年來,企業(yè)和個(gè)人數(shù)據(jù)都呈現(xiàn)爆炸性增長的趨勢。谷歌首席執(zhí)行官EricSchmidt表示,現(xiàn)在全球每兩天所創(chuàng)造的數(shù)據(jù)量等同于從人類文明至2003年間產(chǎn)生的數(shù)據(jù)量的總和。如何存儲海量的數(shù)據(jù),成為當(dāng)前存儲系統(tǒng)所面臨的巨大挑戰(zhàn)。傳統(tǒng)集中存儲方式已經(jīng)滿足不了數(shù)據(jù)存儲的需求,,于是出現(xiàn)了用于大規(guī)模數(shù)據(jù)存儲的分布式文件系統(tǒng),如Google File System(GFS)、Hadoop File System(HDFS)、PVFS、Luster等。 這些分布式文件系統(tǒng)具有良好的可擴(kuò)展性和容錯(cuò)特性,能夠滿足海量數(shù)據(jù)存儲的需求。但是在很多應(yīng)用場合除了要求支持海量大文件的存儲,還需要支持海量小文件的存儲。雖然GFS、HDFS等分布式文件系統(tǒng)能夠滿足大文件的高效存儲,但在存儲海量小文件時(shí),效率卻很低。針對此問題,工業(yè)界和學(xué)術(shù)界提出了很多方法,但普遍存在性能低,系統(tǒng)可靠性不高,不能高效存儲小文件元數(shù)據(jù)等問題。針對這些挑戰(zhàn),本文設(shè)計(jì)實(shí)現(xiàn)了一種基于HDFS的海量小文件存儲系統(tǒng)。 該系統(tǒng)的主要設(shè)計(jì)思想是,在HDFS現(xiàn)有的目錄樹結(jié)構(gòu)下,將一個(gè)文件夾內(nèi)的小文件,打包成一個(gè)大文件進(jìn)行存儲,該文件稱為小文件數(shù)據(jù)文件。同時(shí)生成小文件索引,記錄小文件在對應(yīng)數(shù)據(jù)文件中的位置。 本文設(shè)計(jì)和實(shí)現(xiàn)的基于HDFS的海量小文件存儲系統(tǒng)是可擴(kuò)展、高容錯(cuò)、分布式的海量小文件存儲集群系統(tǒng)。本文提出小文件聚合存儲技術(shù)通過將小文件數(shù)據(jù)存儲在HDFS數(shù)據(jù)文件中,實(shí)現(xiàn)數(shù)據(jù)的分布式存儲和容錯(cuò);同時(shí)提出小文件分布索引管理技術(shù)將索引分布到各個(gè)數(shù)據(jù)節(jié)點(diǎn)管理,解決了單一元數(shù)據(jù)節(jié)點(diǎn)在存儲海量小文件成為瓶頸的缺點(diǎn);設(shè)計(jì)的海量小文件存儲系統(tǒng)索引容錯(cuò)機(jī)制通過對索引進(jìn)行容錯(cuò),降低小文件丟失的風(fēng)險(xiǎn);通過在單個(gè)目錄下創(chuàng)建多個(gè)多數(shù)據(jù)文件,解決訪問同一目錄下小文件沖突的問題。在以上基礎(chǔ)上,系統(tǒng)在客戶端緩存用戶常用到的小文件索引位置及數(shù)據(jù)文件流的信息,提高系統(tǒng)的文件訪問的效率。 通過實(shí)驗(yàn)表明,該系統(tǒng)小文件讀寫延遲、吞吐率與不增加小文件支持的原生HDFS相比有了很大的提高。并且,該系統(tǒng)能夠有效解決海量小文件存儲元數(shù)據(jù)過于龐大的問題,且通過索引容錯(cuò)機(jī)制,提高了該系統(tǒng)的可靠性。
[Abstract]:In recent years, both corporate and personal data have shown an explosive growth trend. Google CEO Eric Schmidt said the amount of data created every two days in the world is now equivalent to the amount of data generated between human civilization and 2003. How to store huge amounts of data has become a great challenge to the current storage system. The traditional centralized storage method can no longer meet the requirement of data storage, so distributed file systems for large-scale data storage, such as Google File system / GFSU / Hadoop File system HDFSU / PVFS Luster, etc., have good extensibility and fault tolerance. It can meet the demand of massive data storage. However, in many applications, it is necessary to support the storage of large files as well as large files. Although distributed file systems such as GFSU HDFS can satisfy the efficient storage of large files, the efficiency of storing large numbers of small files is very low. In order to solve this problem, many methods have been put forward by industry and academic circles. However, there are many problems such as low performance, low reliability of system and low efficient storage of small file metadata. Aiming at these challenges, this paper designs and implements a large amount of small file storage system based on HDFS. The main idea of this system is that, under the existing directory tree structure of HDFS, a small file in a folder is designed. Packaged into a large file for storage, this file is called a small file data file. At the same time, the index of small files is generated, and the location of small files in the corresponding data files is recorded. This paper designs and implements a large amount of small file storage system based on HDFS, which is an extensible, highly fault-tolerant and distributed large size small file storage cluster system. In this paper, we propose a small file aggregation storage technology to realize distributed data storage and fault tolerance by storing small file data in HDFS data file, at the same time, we propose a small file distributed index management technology to distribute the index to each data node management. It solves the problem that the single metadata node becomes the bottleneck in storing the large amount of small files, and the fault-tolerant mechanism of the index of the mass small file storage system can reduce the risk of small file loss by fault-tolerant of the index. By creating multiple data files in a single directory, the problem of accessing small files in the same directory is solved. On the basis of the above, the system caches the information of small file index position and data file flow, which is commonly used by users in the client side, and improves the efficiency of file access of the system. The experiment shows that the system has delayed reading and writing of small files. Throughput is much higher than native HDFS without small file support. Moreover, the system can effectively solve the problem that the large amount of metadata stored in small files is too large, and the reliability of the system is improved by index fault-tolerant mechanism.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 楊德志,黃華,張建剛,許魯;大容量、高性能、高擴(kuò)展能力的藍(lán)鯨分布式文件系統(tǒng)[J];計(jì)算機(jī)研究與發(fā)展;2005年06期
2 余思;桂小林;黃汝維;莊威;;一種提高云存儲中小文件存儲效率的方案[J];西安交通大學(xué)學(xué)報(bào);2011年06期
本文編號:2001333
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2001333.html
最近更新
教材專著