HDFS中文件存儲(chǔ)優(yōu)化的相關(guān)技術(shù)研究
發(fā)布時(shí)間:2018-06-24 15:04
本文選題:Hadoop分布式文件系統(tǒng)(HDFS) + 存儲(chǔ)節(jié)點(diǎn)選擇; 參考:《南京師范大學(xué)》2013年碩士論文
【摘要】:面對(duì)不斷增長(zhǎng)的海量數(shù)據(jù),目前計(jì)算機(jī)領(lǐng)域提出了一種新的計(jì)算模式--云計(jì)算,Hadoop是一個(gè)可實(shí)現(xiàn)大規(guī)模分布式計(jì)算的開(kāi)源框架,具有高吞吐量、高可靠性、高可伸縮性等優(yōu)點(diǎn),因此被廣泛應(yīng)用在云計(jì)算領(lǐng)域。Hadoop中的分布式文件系統(tǒng)HDFS是被設(shè)計(jì)成適合運(yùn)行在通用硬件上的分布式文件系統(tǒng),它是一個(gè)高度容錯(cuò)的系統(tǒng),可以部署在廉價(jià)的機(jī)器上。HDFS能提供高吞吐量的數(shù)據(jù)訪問(wèn),非常適合大規(guī)模數(shù)據(jù)集上的應(yīng)用,并能夠以流的方式讀取文件系統(tǒng)中的數(shù)據(jù)。 但是作為一個(gè)正在不斷發(fā)展中的分布式文件系統(tǒng),HDFS也不可避免的存在一些文件數(shù)據(jù)存儲(chǔ)方面的缺陷。例如HDFS在數(shù)據(jù)副本存儲(chǔ)時(shí),是在機(jī)架上隨機(jī)選擇Datanode進(jìn)行存儲(chǔ),可能導(dǎo)致Datanode負(fù)載不均衡,從而影響整個(gè)系統(tǒng)的性能:并且HDFS最初是被設(shè)計(jì)用來(lái)流式的存儲(chǔ)大文件,未對(duì)小文件的存儲(chǔ)進(jìn)行優(yōu)化,因此在處理小文件時(shí)性能十分低下。本文首先對(duì)分布式文件系統(tǒng)的發(fā)展做一些簡(jiǎn)要的介紹,然后深入分析了HDFS分布式文件系統(tǒng),包括其架構(gòu)、元數(shù)據(jù)管理、以及文件讀寫流程等,并且分析了現(xiàn)有的解決HDFS數(shù)據(jù)存儲(chǔ)及小文件存儲(chǔ)的一些方案的性能以及不足。本文的主要?jiǎng)?chuàng)新點(diǎn)如下: 1、針對(duì)在機(jī)架上隨機(jī)選擇Datanode進(jìn)行數(shù)據(jù)副本存儲(chǔ)時(shí),可能導(dǎo)致Datanode負(fù)載不均衡等問(wèn)題,提出了采用多目標(biāo)優(yōu)化技術(shù),基于Datanode的當(dāng)前運(yùn)行狀態(tài),尋找綜合條件最優(yōu)的Datanode進(jìn)行數(shù)據(jù)存儲(chǔ)的方法。該方法使得數(shù)據(jù)副本均衡的存儲(chǔ)在Datanode中,也可以提高數(shù)據(jù)讀寫的性能。 2、實(shí)際的應(yīng)用中會(huì)產(chǎn)生大量的小文件,針對(duì)HDFS存儲(chǔ)小文件的不足,提出了小文件合并和Client端緩存小文件等策略。在Client端將小文件合并成若干大文件后,將大文件及相關(guān)元數(shù)據(jù)一同存儲(chǔ)到HDFS中;在讀取某個(gè)小文件時(shí),Client端緩存從Datanode返回的包含該小文件的整個(gè)大文件,再次讀取該小文件,或者大文件中的其它小文件時(shí),可以直接從Client端讀取。減少了Client端向Namenode頻繁請(qǐng)求元數(shù)據(jù)的次數(shù),也減少了Client端向Datanode頻繁請(qǐng)求數(shù)據(jù)塊的次數(shù),大大降低小文件的存取時(shí)間。
[Abstract]:In the face of increasing mass data, a new computing model, cloud computing Hadoop, is proposed in the computer field, which is an open source framework for large-scale distributed computing. It has the advantages of high throughput, high reliability, high scalability and so on. So the distributed file system HDFS, which is widely used in cloud computing. Hadoop, is a distributed file system which is designed to run on general hardware. It is a highly fault-tolerant system. It can be deployed on cheap machines. HDFS can provide high throughput data access, is very suitable for large-scale data set applications, and can read data in file system in a stream way. However, as a developing distributed file system, HDFS inevitably has some defects in file data storage. For example, when HDFS stores a copy of data, it selects the DataNode randomly on the rack for storage, which may result in uneven load of the DataNode, which may affect the performance of the entire system: and HDFS was originally designed to stream large files. Storage of small files is not optimized, so performance is very low when processing small files. This paper first introduces the development of distributed file system, then analyzes the HDFS distributed file system, including its architecture, metadata management, file reading and writing process, etc. The performance and shortcomings of existing solutions to HDFS data storage and small file storage are analyzed. The main innovations of this paper are as follows: 1. Aiming at the problem that data replica storage may be caused by random selection of DataNode on the frame, this paper proposes a multi-objective optimization technique based on the current running state of DataNode. The method of data storage for the DataNode with the best synthesis condition is found. This method makes the data copy balanced storage in the DataNode, but also can improve the performance of data reading and writing. 2. In practical applications, a large number of small files will be produced, aiming at the shortcomings of HDFS storage small files. The strategies of small file merging and client side caching are put forward. After the client side merges the small file into a number of large files, the large file and related metadata are stored in HDFS together; when a small file is read, the client side caches the entire large file containing the small file returned from the DataNode, and reads the small file again. Or other small files in large files, can be read directly from the client side. It reduces the frequent request of metadata from the client side to the Namenode and the frequent request of the data block from the client side to the DataNode, which greatly reduces the access time of small files.
【學(xué)位授予單位】:南京師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP316.4;TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 周軼男;王宇;;Hadoop文件系統(tǒng)性能分析[J];電子技術(shù);2011年05期
,本文編號(hào):2061924
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2061924.html
最近更新
教材專著