Hadoop分布式文件系統(tǒng)小文件數(shù)據(jù)存儲性能的優(yōu)化方法研究
發(fā)布時間:2018-02-27 01:16
本文關鍵詞: 內(nèi)存消耗 二級索引 小文件合并 熱存儲 出處:《北京交通大學》2017年碩士論文 論文類型:學位論文
【摘要】:當今社會已進入大數(shù)據(jù)時代,高效的數(shù)據(jù)存儲和讀取已成為人們關注的熱點問題,Hadoop在大數(shù)據(jù)存儲方面體現(xiàn)出了良好的數(shù)據(jù)存儲性能,但是最近隨著博客、維基百科、空間等一系列的社交應用的廣泛應用,小文件數(shù)據(jù)大量產(chǎn)生,對存儲大量小文件數(shù)據(jù)提出了很大挑戰(zhàn),而Hadoop分布式文件系統(tǒng)由于其單一Namenode的結構,在小文件存儲上效率是很低的,并容易導致Namenode瓶頸問題,本文就是在Hadoop分布式文件系統(tǒng)存儲小文件上提出新的解決方案并測試其可行性。論文的研究工作得到了國家自然科學基金項目(No.61271308、61172072、61401015),北京市教育委員會研究生學科建設項目和中國電建集團成都勘測設計研究院項目的支持,論文的主要工作如下:首先,論文分析了 Hadoop分布式文件系統(tǒng)的特點及問題:單一 Namenode在存儲海量小文件上會產(chǎn)生大量元數(shù)據(jù)信息,導致Namenode內(nèi)存消耗過大。因此采用了小文件合并大文件的方案解決,但是,小文件合并大文件后小文件讀取需要二次索引才能讀取對應小文件,文件讀取效率會受一定影響,因此,通過引入二級索引元數(shù)據(jù)信息以及加入預取和緩存機制來提高小文件的讀取效率。通過上述分析,本文提出了一種擴展的Hadoop分布式文件系統(tǒng)框架結構,主要是在用戶層和數(shù)據(jù)存儲層中間加了一個數(shù)據(jù)處理層,主要完成的是小文件合并和文件預取和緩存工作,從而提高小文件存儲的存儲性能。論文在擴展的Hadoop分布式文件系統(tǒng)框架結構中,主要應用了以下算法:基于文件類型的小文件合并算法,通過將大量小文件按文件擴展名進行簡單分類處理,然后合并成大文件,有效地降低了 Namenode的內(nèi)存消耗;基于文件類型的合并文件元數(shù)據(jù)二級索引算法,通過提高小文件合并大文件的映射文件的讀取速度進而提高了系統(tǒng)整體的文件讀取效率;基于動態(tài)頻率統(tǒng)計的熱存儲算法,通過將一定時間內(nèi)將讀取頻率最高的合并文件保存到文件預取和緩存部分,當用戶發(fā)出請求讀取預取和緩存部分文件時,不需要同Namenode進行交互,可直接讀取對應小文件,也提高了文件讀取效率。論文最后搭建了 Hadoop偽分布式平臺,通過比較原始HDFS存儲結構、HAR歸檔文件、改進HDFS存儲結構在Namenode內(nèi)存消耗、文件寫入效率、文件讀取效率三個方面進行驗證分析,實驗結果表明改進的HDFS存儲結構雖然一定程度上影響了文件寫入效率,但有效地降低了 Namenode的內(nèi)存消耗,提高了小文件讀取效率,因而相比原來的小文件存儲方案有更好的存儲性能體現(xiàn)。
[Abstract]:Nowadays, the society has entered the era of big data, and efficient data storage and reading has become a hot issue that people pay attention to. Hadoop has shown good data storage performance in the storage of big data, but recently with the blog, Wikipedia, Space is widely used in a series of social applications, and small file data is produced in large quantities, which poses a great challenge to store a large number of small file data. However, Hadoop distributed file system is based on its single Namenode structure. It is inefficient to store small files and can easily lead to Namenode bottlenecks. This paper puts forward a new solution on the Hadoop distributed file system storage small files and tests its feasibility. The research work of this paper has been obtained from the National Natural Science Foundation Project No. 61271308 / 61172072 / 61401015. Set up projects and support from Chengdu Survey and Design Research Institute of China Electric Power Construction Group, The main work of this paper is as follows: firstly, this paper analyzes the characteristics and problems of Hadoop distributed file system: a single Namenode can generate a large amount of metadata information on the storage of large amount of small files. The Namenode memory consumption is too large. Therefore, the solution is to merge large files with small files. However, after small files merge large files, the reading of small files requires two indexes to read the corresponding small files, and the efficiency of file reading will be affected to a certain extent. Therefore, the reading efficiency of small files is improved by introducing secondary index metadata information and adding prefetching and caching mechanisms. Through the above analysis, an extended Hadoop distributed file system framework is proposed in this paper. It mainly adds a data processing layer between the user layer and the data storage layer, which mainly completes the small file merging and file prefetching and caching. In order to improve the storage performance of small file storage, this paper mainly uses the following algorithms in the extended Hadoop distributed file system framework: file type based small file merging algorithm, By simply classifying a large number of small files according to file extensions and then merging them into large files, the memory consumption of Namenode is effectively reduced. By improving the reading speed of the mapping files of small files and merging large files, the overall reading efficiency of the system is improved, and the hot storage algorithm based on dynamic frequency statistics is proposed. By saving the most frequently read merged files to the file prefetch and cache parts in a certain time, when the user sends a request to read the prefetched and cached part files, there is no need to interact with the Namenode, so the corresponding small files can be read directly. Finally, a pseudo-distributed Hadoop platform is built to improve the memory consumption and file writing efficiency of HDFS storage structure in Namenode by comparing the original HDFS storage structure with Har archive file. The experimental results show that the improved HDFS storage structure affects the efficiency of file writing to some extent, but it can effectively reduce the memory consumption of Namenode and improve the efficiency of small file reading. Therefore, compared with the original small file storage scheme has better storage performance.
【學位授予單位】:北京交通大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP333;TP311.13
【參考文獻】
相關期刊論文 前7條
1 鄒振宇;鄭p,
本文編號:1540494
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1540494.html
最近更新
教材專著