Hadoop分布式文件系統(tǒng)小文件數(shù)據(jù)存儲(chǔ)性能的優(yōu)化方法研究
發(fā)布時(shí)間:2018-02-27 01:16
本文關(guān)鍵詞: 內(nèi)存消耗 二級(jí)索引 小文件合并 熱存儲(chǔ) 出處:《北京交通大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:當(dāng)今社會(huì)已進(jìn)入大數(shù)據(jù)時(shí)代,高效的數(shù)據(jù)存儲(chǔ)和讀取已成為人們關(guān)注的熱點(diǎn)問(wèn)題,Hadoop在大數(shù)據(jù)存儲(chǔ)方面體現(xiàn)出了良好的數(shù)據(jù)存儲(chǔ)性能,但是最近隨著博客、維基百科、空間等一系列的社交應(yīng)用的廣泛應(yīng)用,小文件數(shù)據(jù)大量產(chǎn)生,對(duì)存儲(chǔ)大量小文件數(shù)據(jù)提出了很大挑戰(zhàn),而Hadoop分布式文件系統(tǒng)由于其單一Namenode的結(jié)構(gòu),在小文件存儲(chǔ)上效率是很低的,并容易導(dǎo)致Namenode瓶頸問(wèn)題,本文就是在Hadoop分布式文件系統(tǒng)存儲(chǔ)小文件上提出新的解決方案并測(cè)試其可行性。論文的研究工作得到了國(guó)家自然科學(xué)基金項(xiàng)目(No.61271308、61172072、61401015),北京市教育委員會(huì)研究生學(xué)科建設(shè)項(xiàng)目和中國(guó)電建集團(tuán)成都勘測(cè)設(shè)計(jì)研究院項(xiàng)目的支持,論文的主要工作如下:首先,論文分析了 Hadoop分布式文件系統(tǒng)的特點(diǎn)及問(wèn)題:單一 Namenode在存儲(chǔ)海量小文件上會(huì)產(chǎn)生大量元數(shù)據(jù)信息,導(dǎo)致Namenode內(nèi)存消耗過(guò)大。因此采用了小文件合并大文件的方案解決,但是,小文件合并大文件后小文件讀取需要二次索引才能讀取對(duì)應(yīng)小文件,文件讀取效率會(huì)受一定影響,因此,通過(guò)引入二級(jí)索引元數(shù)據(jù)信息以及加入預(yù)取和緩存機(jī)制來(lái)提高小文件的讀取效率。通過(guò)上述分析,本文提出了一種擴(kuò)展的Hadoop分布式文件系統(tǒng)框架結(jié)構(gòu),主要是在用戶層和數(shù)據(jù)存儲(chǔ)層中間加了一個(gè)數(shù)據(jù)處理層,主要完成的是小文件合并和文件預(yù)取和緩存工作,從而提高小文件存儲(chǔ)的存儲(chǔ)性能。論文在擴(kuò)展的Hadoop分布式文件系統(tǒng)框架結(jié)構(gòu)中,主要應(yīng)用了以下算法:基于文件類型的小文件合并算法,通過(guò)將大量小文件按文件擴(kuò)展名進(jìn)行簡(jiǎn)單分類處理,然后合并成大文件,有效地降低了 Namenode的內(nèi)存消耗;基于文件類型的合并文件元數(shù)據(jù)二級(jí)索引算法,通過(guò)提高小文件合并大文件的映射文件的讀取速度進(jìn)而提高了系統(tǒng)整體的文件讀取效率;基于動(dòng)態(tài)頻率統(tǒng)計(jì)的熱存儲(chǔ)算法,通過(guò)將一定時(shí)間內(nèi)將讀取頻率最高的合并文件保存到文件預(yù)取和緩存部分,當(dāng)用戶發(fā)出請(qǐng)求讀取預(yù)取和緩存部分文件時(shí),不需要同Namenode進(jìn)行交互,可直接讀取對(duì)應(yīng)小文件,也提高了文件讀取效率。論文最后搭建了 Hadoop偽分布式平臺(tái),通過(guò)比較原始HDFS存儲(chǔ)結(jié)構(gòu)、HAR歸檔文件、改進(jìn)HDFS存儲(chǔ)結(jié)構(gòu)在Namenode內(nèi)存消耗、文件寫入效率、文件讀取效率三個(gè)方面進(jìn)行驗(yàn)證分析,實(shí)驗(yàn)結(jié)果表明改進(jìn)的HDFS存儲(chǔ)結(jié)構(gòu)雖然一定程度上影響了文件寫入效率,但有效地降低了 Namenode的內(nèi)存消耗,提高了小文件讀取效率,因而相比原來(lái)的小文件存儲(chǔ)方案有更好的存儲(chǔ)性能體現(xiàn)。
[Abstract]:Nowadays, the society has entered the era of big data, and efficient data storage and reading has become a hot issue that people pay attention to. Hadoop has shown good data storage performance in the storage of big data, but recently with the blog, Wikipedia, Space is widely used in a series of social applications, and small file data is produced in large quantities, which poses a great challenge to store a large number of small file data. However, Hadoop distributed file system is based on its single Namenode structure. It is inefficient to store small files and can easily lead to Namenode bottlenecks. This paper puts forward a new solution on the Hadoop distributed file system storage small files and tests its feasibility. The research work of this paper has been obtained from the National Natural Science Foundation Project No. 61271308 / 61172072 / 61401015. Set up projects and support from Chengdu Survey and Design Research Institute of China Electric Power Construction Group, The main work of this paper is as follows: firstly, this paper analyzes the characteristics and problems of Hadoop distributed file system: a single Namenode can generate a large amount of metadata information on the storage of large amount of small files. The Namenode memory consumption is too large. Therefore, the solution is to merge large files with small files. However, after small files merge large files, the reading of small files requires two indexes to read the corresponding small files, and the efficiency of file reading will be affected to a certain extent. Therefore, the reading efficiency of small files is improved by introducing secondary index metadata information and adding prefetching and caching mechanisms. Through the above analysis, an extended Hadoop distributed file system framework is proposed in this paper. It mainly adds a data processing layer between the user layer and the data storage layer, which mainly completes the small file merging and file prefetching and caching. In order to improve the storage performance of small file storage, this paper mainly uses the following algorithms in the extended Hadoop distributed file system framework: file type based small file merging algorithm, By simply classifying a large number of small files according to file extensions and then merging them into large files, the memory consumption of Namenode is effectively reduced. By improving the reading speed of the mapping files of small files and merging large files, the overall reading efficiency of the system is improved, and the hot storage algorithm based on dynamic frequency statistics is proposed. By saving the most frequently read merged files to the file prefetch and cache parts in a certain time, when the user sends a request to read the prefetched and cached part files, there is no need to interact with the Namenode, so the corresponding small files can be read directly. Finally, a pseudo-distributed Hadoop platform is built to improve the memory consumption and file writing efficiency of HDFS storage structure in Namenode by comparing the original HDFS storage structure with Har archive file. The experimental results show that the improved HDFS storage structure affects the efficiency of file writing to some extent, but it can effectively reduce the memory consumption of Namenode and improve the efficiency of small file reading. Therefore, compared with the original small file storage scheme has better storage performance.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP333;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 鄒振宇;鄭p,
本文編號(hào):1540494
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1540494.html
最近更新
教材專著