基于HDFS的分布式存儲(chǔ)研究與應(yīng)用
發(fā)布時(shí)間:2019-02-28 18:34
【摘要】:信息科技發(fā)展導(dǎo)致大規(guī)模數(shù)據(jù)存儲(chǔ)變得更平常。并且數(shù)據(jù)需要長期保存,同時(shí)數(shù)據(jù)規(guī)模在不斷增長。而傳統(tǒng)的文件系統(tǒng)在存儲(chǔ)空間和存儲(chǔ)速度、安全上達(dá)不到大量數(shù)據(jù)保存處理的要求。分布式文件系統(tǒng)能夠存儲(chǔ)海量數(shù)據(jù)的,,是大規(guī)模數(shù)據(jù)存儲(chǔ)的重要技術(shù)手段。而近年Hadoop作為一種存儲(chǔ)和處理大規(guī)模數(shù)據(jù)的解決方案,受到國內(nèi)外各大公司的熱捧。Hadoop分布式文件系統(tǒng)是Hadoop的兩大核心之一,可以作為大規(guī)模數(shù)據(jù)存儲(chǔ)的解決方案。 對基于HDFS的分布式存儲(chǔ)進(jìn)行相關(guān)研究,主要包括HDFS集群中小文件處理、副本存放策略和機(jī)架感知以及NameNode備份恢復(fù)機(jī)制和擴(kuò)展機(jī)制。HDFS集群中小文件處理包括三種方案,分別是Hadoop Archive、 Sequence File和CombineFileInputFormat。副本存放策略和機(jī)架感知能夠讓NameNode獲取DataNode的網(wǎng)絡(luò)拓?fù)鋱D,然后根據(jù)DataNode之間的關(guān)系來確定副本存放的位置,保證數(shù)據(jù)可靠性的同時(shí)兼顧了數(shù)據(jù)傳輸效率。NameNode備份恢復(fù)機(jī)制通過定期對NameNode中元數(shù)據(jù)信息備份合并形成新的檢查點(diǎn)checkpoint保證NameNode元數(shù)據(jù)的安全。如果NameNode出現(xiàn)宕機(jī)故障,可以節(jié)省NameNode重啟時(shí)間,甚至恢復(fù)丟失的數(shù)據(jù)。HDFS的可擴(kuò)展性體現(xiàn)在動(dòng)態(tài)新增DataNode,能夠滿足大規(guī)模數(shù)據(jù)增長的需求。 最后基于HDFS集群進(jìn)行相關(guān)應(yīng)用,并比較HDFS集群與FTP文件傳輸效率,反映HDFS作為大規(guī)模數(shù)據(jù)存儲(chǔ)的解決方案的可行性。
[Abstract]:The development of information technology has made large-scale data storage more common. And the data needs to be preserved for a long time, and the size of the data is growing. However, the traditional file system can not meet the requirement of saving and processing a large amount of data safely in storage space and storage speed. Distributed file system can store massive data, is an important technical means of large-scale data storage. In recent years, as a solution to store and process large-scale data, Hadoop is a popular solution for large-scale data storage at home and abroad. Hadoop distributed file system is one of the two cores of Hadoop, and it can be used as a solution for large-scale data storage. This paper studies the distributed storage based on HDFS, mainly including HDFS cluster small file processing, copy storage strategy and rack awareness, and NameNode backup recovery mechanism and extension mechanism. Hadoop Archive, Sequence File and CombineFileInputFormat., respectively. Replica storage policy and rack awareness allow NameNode to obtain a network topology diagram of DataNode and then determine where the copy is stored based on the relationship between the DataNode. The NameNode backup and recovery mechanism guarantees the security of NameNode metadata by periodically merging metadata backup in NameNode to form a new checkpoint. If NameNode fails, it can save NameNode restart time and even recover lost data. The scalability of HDFS is that the dynamic new DataNode, can meet the demand of large-scale data growth. Finally, the related applications based on HDFS cluster are carried out, and the efficiency of file transfer between HDFS cluster and FTP is compared, which reflects the feasibility of HDFS as a solution for large-scale data storage.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP333
本文編號:2432056
[Abstract]:The development of information technology has made large-scale data storage more common. And the data needs to be preserved for a long time, and the size of the data is growing. However, the traditional file system can not meet the requirement of saving and processing a large amount of data safely in storage space and storage speed. Distributed file system can store massive data, is an important technical means of large-scale data storage. In recent years, as a solution to store and process large-scale data, Hadoop is a popular solution for large-scale data storage at home and abroad. Hadoop distributed file system is one of the two cores of Hadoop, and it can be used as a solution for large-scale data storage. This paper studies the distributed storage based on HDFS, mainly including HDFS cluster small file processing, copy storage strategy and rack awareness, and NameNode backup recovery mechanism and extension mechanism. Hadoop Archive, Sequence File and CombineFileInputFormat., respectively. Replica storage policy and rack awareness allow NameNode to obtain a network topology diagram of DataNode and then determine where the copy is stored based on the relationship between the DataNode. The NameNode backup and recovery mechanism guarantees the security of NameNode metadata by periodically merging metadata backup in NameNode to form a new checkpoint. If NameNode fails, it can save NameNode restart time and even recover lost data. The scalability of HDFS is that the dynamic new DataNode, can meet the demand of large-scale data growth. Finally, the related applications based on HDFS cluster are carried out, and the efficiency of file transfer between HDFS cluster and FTP is compared, which reflects the feasibility of HDFS as a solution for large-scale data storage.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP333
【引證文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 張興;基于Hadoop的云存儲(chǔ)平臺的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
本文編號:2432056
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2432056.html
最近更新
教材專著