天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 計算機論文 >

Hadoop分布式文件系統(tǒng)小文件數(shù)據(jù)存儲性能的優(yōu)化方法研究

發(fā)布時間:2018-02-27 01:16

  本文關鍵詞: 內(nèi)存消耗 二級索引 小文件合并 熱存儲 出處:《北京交通大學》2017年碩士論文 論文類型:學位論文


【摘要】:當今社會已進入大數(shù)據(jù)時代,高效的數(shù)據(jù)存儲和讀取已成為人們關注的熱點問題,Hadoop在大數(shù)據(jù)存儲方面體現(xiàn)出了良好的數(shù)據(jù)存儲性能,但是最近隨著博客、維基百科、空間等一系列的社交應用的廣泛應用,小文件數(shù)據(jù)大量產(chǎn)生,對存儲大量小文件數(shù)據(jù)提出了很大挑戰(zhàn),而Hadoop分布式文件系統(tǒng)由于其單一Namenode的結構,在小文件存儲上效率是很低的,并容易導致Namenode瓶頸問題,本文就是在Hadoop分布式文件系統(tǒng)存儲小文件上提出新的解決方案并測試其可行性。論文的研究工作得到了國家自然科學基金項目(No.61271308、61172072、61401015),北京市教育委員會研究生學科建設項目和中國電建集團成都勘測設計研究院項目的支持,論文的主要工作如下:首先,論文分析了 Hadoop分布式文件系統(tǒng)的特點及問題:單一 Namenode在存儲海量小文件上會產(chǎn)生大量元數(shù)據(jù)信息,導致Namenode內(nèi)存消耗過大。因此采用了小文件合并大文件的方案解決,但是,小文件合并大文件后小文件讀取需要二次索引才能讀取對應小文件,文件讀取效率會受一定影響,因此,通過引入二級索引元數(shù)據(jù)信息以及加入預取和緩存機制來提高小文件的讀取效率。通過上述分析,本文提出了一種擴展的Hadoop分布式文件系統(tǒng)框架結構,主要是在用戶層和數(shù)據(jù)存儲層中間加了一個數(shù)據(jù)處理層,主要完成的是小文件合并和文件預取和緩存工作,從而提高小文件存儲的存儲性能。論文在擴展的Hadoop分布式文件系統(tǒng)框架結構中,主要應用了以下算法:基于文件類型的小文件合并算法,通過將大量小文件按文件擴展名進行簡單分類處理,然后合并成大文件,有效地降低了 Namenode的內(nèi)存消耗;基于文件類型的合并文件元數(shù)據(jù)二級索引算法,通過提高小文件合并大文件的映射文件的讀取速度進而提高了系統(tǒng)整體的文件讀取效率;基于動態(tài)頻率統(tǒng)計的熱存儲算法,通過將一定時間內(nèi)將讀取頻率最高的合并文件保存到文件預取和緩存部分,當用戶發(fā)出請求讀取預取和緩存部分文件時,不需要同Namenode進行交互,可直接讀取對應小文件,也提高了文件讀取效率。論文最后搭建了 Hadoop偽分布式平臺,通過比較原始HDFS存儲結構、HAR歸檔文件、改進HDFS存儲結構在Namenode內(nèi)存消耗、文件寫入效率、文件讀取效率三個方面進行驗證分析,實驗結果表明改進的HDFS存儲結構雖然一定程度上影響了文件寫入效率,但有效地降低了 Namenode的內(nèi)存消耗,提高了小文件讀取效率,因而相比原來的小文件存儲方案有更好的存儲性能體現(xiàn)。
[Abstract]:Nowadays, the society has entered the era of big data, and efficient data storage and reading has become a hot issue that people pay attention to. Hadoop has shown good data storage performance in the storage of big data, but recently with the blog, Wikipedia, Space is widely used in a series of social applications, and small file data is produced in large quantities, which poses a great challenge to store a large number of small file data. However, Hadoop distributed file system is based on its single Namenode structure. It is inefficient to store small files and can easily lead to Namenode bottlenecks. This paper puts forward a new solution on the Hadoop distributed file system storage small files and tests its feasibility. The research work of this paper has been obtained from the National Natural Science Foundation Project No. 61271308 / 61172072 / 61401015. Set up projects and support from Chengdu Survey and Design Research Institute of China Electric Power Construction Group, The main work of this paper is as follows: firstly, this paper analyzes the characteristics and problems of Hadoop distributed file system: a single Namenode can generate a large amount of metadata information on the storage of large amount of small files. The Namenode memory consumption is too large. Therefore, the solution is to merge large files with small files. However, after small files merge large files, the reading of small files requires two indexes to read the corresponding small files, and the efficiency of file reading will be affected to a certain extent. Therefore, the reading efficiency of small files is improved by introducing secondary index metadata information and adding prefetching and caching mechanisms. Through the above analysis, an extended Hadoop distributed file system framework is proposed in this paper. It mainly adds a data processing layer between the user layer and the data storage layer, which mainly completes the small file merging and file prefetching and caching. In order to improve the storage performance of small file storage, this paper mainly uses the following algorithms in the extended Hadoop distributed file system framework: file type based small file merging algorithm, By simply classifying a large number of small files according to file extensions and then merging them into large files, the memory consumption of Namenode is effectively reduced. By improving the reading speed of the mapping files of small files and merging large files, the overall reading efficiency of the system is improved, and the hot storage algorithm based on dynamic frequency statistics is proposed. By saving the most frequently read merged files to the file prefetch and cache parts in a certain time, when the user sends a request to read the prefetched and cached part files, there is no need to interact with the Namenode, so the corresponding small files can be read directly. Finally, a pseudo-distributed Hadoop platform is built to improve the memory consumption and file writing efficiency of HDFS storage structure in Namenode by comparing the original HDFS storage structure with Har archive file. The experimental results show that the improved HDFS storage structure affects the efficiency of file writing to some extent, but it can effectively reduce the memory consumption of Namenode and improve the efficiency of small file reading. Therefore, compared with the original small file storage scheme has better storage performance.
【學位授予單位】:北京交通大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP333;TP311.13

【參考文獻】

相關期刊論文 前7條

1 鄒振宇;鄭p,

本文編號:1540494


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1540494.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶814c6***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
国产成人精品国产亚洲欧洲| 亚洲精品偷拍视频免费观看| 欧美丝袜诱惑一区二区| 国产伦精品一区二区三区高清版| 欧美区一区二在线播放| 丁香七月啪啪激情综合| 好吊妞视频这里有精品| 国内尹人香蕉综合在线| 欧美一区二区三区视频区 | 激情图日韩精品中文字幕| 成年男女午夜久久久精品| 亚洲中文字幕综合网在线| 亚洲国产91精品视频| 激情丁香激情五月婷婷| 午夜福利视频偷拍91| 亚洲中文字幕乱码亚洲| 国产一级特黄在线观看| 久久老熟女一区二区三区福利| 丰满人妻一二三区av| 国产精品九九九一区二区| 欧美一区二区三区五月婷婷| 欧美人妻少妇精品久久性色| 欧美久久一区二区精品| 99久久精品视频一区二区| 国产又大又猛又粗又长又爽| 日本熟妇熟女久久综合| 亚洲精品av少妇在线观看| 国产精品久久男人的天堂| 99国产成人免费一区二区| 一区二区三区日本高清| 国产精品成人一区二区三区夜夜夜| av在线免费观看一区二区三区| 小黄片大全欧美一区二区| 欧美一区二区三区不卡高清视| 草草草草在线观看视频| 国产精品成人免费精品自在线观看 | 中国黄色色片色哟哟哟哟哟哟| 欧美自拍偷自拍亚洲精品| 中文字幕一二区在线观看| 亚洲午夜av一区二区| 国产成人在线一区二区三区|