一種優(yōu)化HDFS小寫文件存儲策略研究與實現(xiàn)
發(fā)布時間:2018-08-30 13:23
【摘要】:隨著互聯(lián)網(wǎng)數(shù)據(jù)迅猛增長,在大數(shù)據(jù)時代存儲和處理這些海量數(shù)據(jù)成為最大的挑戰(zhàn)之一,各種各樣的云存儲系統(tǒng)開始涌現(xiàn),國內(nèi)外公司都投入到各自云存儲系統(tǒng)研究和開發(fā)中。HDFS是Google GFS開源實現(xiàn)的分布式文件系統(tǒng),專門用于存儲海量大數(shù)據(jù),具有高可靠性、高可用性、高伸縮性等特點。HDFS集群采用主從架構(gòu),一個中心節(jié)點用于保存文件系統(tǒng)的元數(shù)據(jù),許多個數(shù)據(jù)節(jié)點用來存放實際的數(shù)據(jù)。大文件被分割多個塊,被存放在數(shù)據(jù)節(jié)點中,分布在不同數(shù)據(jù)節(jié)點上。當(dāng)HDFS應(yīng)用于含有大量的小文件場景中,會造成中心節(jié)點內(nèi)存急劇消耗,限制HDFS集群容量,同時造成中心節(jié)點洪泛查詢的壓力。 論文研究了HDFS自帶的小寫文件存儲的解決方案,它們采用遠(yuǎn)端合并壓縮的方法,但是由于存在多級索引過程,導(dǎo)致讀寫性能低下。針對HDFS自帶方案的不足,提出了一種客戶端小寫文件合并策略。該方案將小文件在客戶端緩存合并成一個大文件,同時小文件在大文件的偏移信息寫入大文件的開頭部分,然后作為一個文件塊存入數(shù)據(jù)節(jié)點;在數(shù)據(jù)節(jié)點端添加小文件映射表,實現(xiàn)了對原生Inode結(jié)構(gòu)的拓展;在數(shù)據(jù)節(jié)點通過小文件索引信息,提取小文件內(nèi)容;并通過采用緩存預(yù)取策略來提高讀取性能。 最后設(shè)計測試方案,,對拓展系統(tǒng)進(jìn)行了內(nèi)存占用、讀寫性能等方面的測試,通過與原系統(tǒng)小文件存儲方案進(jìn)行性能比較,發(fā)現(xiàn)系統(tǒng)內(nèi)存使用節(jié)省達(dá)70%,寫文件時間平均縮短20%,通過預(yù)取策略文件讀時間平均縮短40%。
[Abstract]:With the rapid growth of Internet data, storage and processing of these massive data has become one of the biggest challenges in big data's time, and various cloud storage systems have begun to emerge. HDFS is a distributed file system implemented by Google GFS open source, which is specially used to store huge amounts of big data, with high reliability and high availability. High scalability. HDFS cluster adopts master-slave architecture, a central node is used to store metadata of the file system, and many data nodes are used to store actual data. Large files are divided into blocks, stored in data nodes and distributed on different data nodes. When HDFS is applied to the scenario containing a large number of small files, it will cause a sharp consumption of memory in the center node, limit the capacity of the HDFS cluster, and at the same time cause the pressure of flooding query of the center node. In this paper, we study the solutions to the storage of lowercase files in HDFS. They adopt the method of remote merging and compression, but because of the existence of multilevel index process, the performance of reading and writing is low. Aiming at the shortage of HDFS's own scheme, this paper puts forward a strategy of client case file merging. In this scheme, the small files are merged into a large file in the client cache, and the small files are written into the beginning part of the large files in the offset information of the large files, and then stored as a file block in the data node. By adding small file mapping table to the data node, the native Inode structure is extended; the small file content is extracted through the small file index information in the data node; and the reading performance is improved by using the cache prefetching strategy. Finally, the test scheme is designed to test the memory occupation, read and write performance of the extended system, and the performance of the extended system is compared with that of the original small file storage scheme. It is found that the system memory is saved up to 70 percent, the average writing time is shortened by 20 percent, and the average reading time by prefetching policy files is shortened by 40 percent.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333
本文編號:2213203
[Abstract]:With the rapid growth of Internet data, storage and processing of these massive data has become one of the biggest challenges in big data's time, and various cloud storage systems have begun to emerge. HDFS is a distributed file system implemented by Google GFS open source, which is specially used to store huge amounts of big data, with high reliability and high availability. High scalability. HDFS cluster adopts master-slave architecture, a central node is used to store metadata of the file system, and many data nodes are used to store actual data. Large files are divided into blocks, stored in data nodes and distributed on different data nodes. When HDFS is applied to the scenario containing a large number of small files, it will cause a sharp consumption of memory in the center node, limit the capacity of the HDFS cluster, and at the same time cause the pressure of flooding query of the center node. In this paper, we study the solutions to the storage of lowercase files in HDFS. They adopt the method of remote merging and compression, but because of the existence of multilevel index process, the performance of reading and writing is low. Aiming at the shortage of HDFS's own scheme, this paper puts forward a strategy of client case file merging. In this scheme, the small files are merged into a large file in the client cache, and the small files are written into the beginning part of the large files in the offset information of the large files, and then stored as a file block in the data node. By adding small file mapping table to the data node, the native Inode structure is extended; the small file content is extracted through the small file index information in the data node; and the reading performance is improved by using the cache prefetching strategy. Finally, the test scheme is designed to test the memory occupation, read and write performance of the extended system, and the performance of the extended system is compared with that of the original small file storage scheme. It is found that the system memory is saved up to 70 percent, the average writing time is shortened by 20 percent, and the average reading time by prefetching policy files is shortened by 40 percent.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 陳劍;龔發(fā)根;;一種優(yōu)化分布式文件系統(tǒng)的文件合并策略[J];計算機(jī)應(yīng)用;2011年S2期
2 張春明;芮建武;何婷婷;;一種Hadoop小文件存儲和讀取的方法[J];計算機(jī)應(yīng)用與軟件;2012年11期
3 余思;桂小林;黃汝維;莊威;;一種提高云存儲中小文件存儲效率的方案[J];西安交通大學(xué)學(xué)報;2011年06期
4 朱光耀;;Hadoop中海量小文件的處理分析[J];科技資訊;2012年28期
本文編號:2213203
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2213203.html
最近更新
教材專著