基于Hadoop的云存儲平臺的研究與實現(xiàn)
發(fā)布時間:2018-09-04 11:51
【摘要】:近年來,云計算日益成為國內(nèi)外關(guān)注的焦點。當(dāng)云計算系統(tǒng)中運算和處理的核心是大量數(shù)據(jù)的存儲時,云計算系統(tǒng)就衍變?yōu)橐粋云存儲系統(tǒng)。云計算的飛速發(fā)展,使云存儲也成為當(dāng)前業(yè)界最熱門的研究領(lǐng)域。云存儲作為一種新的服務(wù),,它將用戶的數(shù)據(jù)存儲在云端服務(wù)器上,用戶只要通過互聯(lián)網(wǎng)登錄云存儲服務(wù)系統(tǒng),就可以在任何地方任何時候訪問自己的數(shù)據(jù),并且不用擔(dān)心數(shù)據(jù)會丟失。 Hadoop是Apache開發(fā)的一種開源的分布式計算平臺,在分布式計算和數(shù)據(jù)存儲方面表現(xiàn)出優(yōu)異的性能,引起了國內(nèi)外知名IT企業(yè)的關(guān)注,各大企業(yè)和科研機構(gòu)紛紛投入研究,使得Hadoop在云計算和云存儲中的應(yīng)用越來越廣泛。HDFS是Hadoop的分布式文件系統(tǒng),它具有強大的數(shù)據(jù)存儲能力,適合云存儲系統(tǒng)。但它在設(shè)計上存在一些缺陷,性能上并不完美,要想大規(guī)模推廣使用,必須先進行改進。 本文主要研究基于HDFS的云存儲模型,針對HDFS在小文件存儲不理想和副本分布不均衡兩個問題上對其進行改進,并使用改進后的HDFS搭建云存儲平臺。主要工作如下: 1. HDFS為確保數(shù)據(jù)存儲的可靠性,采用副本機制將文件的副本存儲在集群中。文件副本以數(shù)據(jù)塊的形式存放在不同的DataNode上,然而HDFS默認(rèn)的副本分布策略具有隨機性,不能保證副本均衡地分布在集群中。為解決這一問題,本文提出了一種基于加權(quán)評價指標(biāo)矩陣選擇距離最優(yōu)解最近、最差解最遠(yuǎn)的節(jié)點的算法,對權(quán)值的確定采用層次分析法進行計算,在兼顧節(jié)點負(fù)載的同時,著重考察空間使用率,選擇最合適的DataNode來放置數(shù)據(jù)副本,使各DataNode的空間負(fù)載整體均衡。 2. HDFS是為大文件設(shè)計的,不適合大量小文件的存儲。相同數(shù)據(jù)量情況下,小文件會浪費NameNode的內(nèi)存,同時降低訪問效率。針對這一問題,本文對HDFS的文件存儲過程進行改進,在文件上傳到HDFS集群之前先進行判斷,如果是小文件則需要進行合并優(yōu)化處理,并將小文件的索引信息以鍵值對的形式保存在索引文件中。改進方案減小了大量小文件對NameNode內(nèi)存的消耗,并提高了訪問效率。 3.進行大量實驗,將原HDFS與改進方案進行對比,實驗結(jié)果證明,本文提出的改進方案具有更好的效果,能夠改善HDFS的性能。使用改進后的Hadoop搭建存儲集群,開發(fā)Web應(yīng)用程序,通過B/S模式模擬云存儲平臺,實現(xiàn)云存儲的相關(guān)功能。
[Abstract]:In recent years, cloud computing has increasingly become the focus of attention at home and abroad. When the core of computing and processing in cloud computing system is the storage of a lot of data, cloud computing system evolves into a cloud storage system. With the rapid development of cloud computing, cloud storage has become the hottest research field. Cloud storage, as a new service, stores the user's data on the cloud server. As long as the user logs on to the cloud storage service system through the Internet, he can access his data anywhere at any time. Hadoop is an open source distributed computing platform developed by Apache. It has shown excellent performance in distributed computing and data storage, and has attracted the attention of well-known IT enterprises at home and abroad. Many large enterprises and scientific research institutions have put into research, making the application of Hadoop in cloud computing and cloud storage more and more extensive. HDFS is a distributed file system of Hadoop, which has powerful data storage ability and is suitable for cloud storage system. However, it has some defects in design, and its performance is not perfect. If it is to be widely used, it must be improved first. In this paper, the cloud storage model based on HDFS is studied, and the HDFS is improved on the problems of poor storage of small files and uneven distribution of replica. The improved HDFS is used to build cloud storage platform. The main work is as follows: 1. In order to ensure the reliability of data storage, HDFS uses replica mechanism to store copies of files in the cluster. File replicas are stored on different DataNode in the form of data blocks. However, the default replica distribution policy of HDFS is random and cannot guarantee the balanced distribution of replicas in the cluster. In order to solve this problem, this paper proposes an algorithm based on weighted evaluation index matrix to select the node nearest to the best solution and farthest from the worst solution. The weight of the node is determined by the analytic hierarchy process (AHP), and the load of the node is taken into account at the same time. Focus on space utilization, select the most appropriate DataNode to place data copies, so that the overall balance of the DataNode space load. 2. 2. HDFS is designed for large files and is not suitable for storage of large numbers of small files. With the same amount of data, small files waste NameNode memory and reduce access efficiency. In order to solve this problem, this paper improves the file stored procedure of HDFS, judges the file before uploading it to the HDFS cluster, and if it is a small file, it needs to combine and optimize. The index information of the small file is stored in the index file as a key-value pair. The improved scheme reduces the consumption of a large number of small files to NameNode memory, and improves the access efficiency. 3. 3. A large number of experiments have been carried out to compare the original HDFS with the improved scheme. The experimental results show that the proposed scheme has better effect and can improve the performance of HDFS. The improved Hadoop is used to build the storage cluster, develop the Web application program, simulate the cloud storage platform through the B / S mode, and realize the related function of cloud storage.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333
[Abstract]:In recent years, cloud computing has increasingly become the focus of attention at home and abroad. When the core of computing and processing in cloud computing system is the storage of a lot of data, cloud computing system evolves into a cloud storage system. With the rapid development of cloud computing, cloud storage has become the hottest research field. Cloud storage, as a new service, stores the user's data on the cloud server. As long as the user logs on to the cloud storage service system through the Internet, he can access his data anywhere at any time. Hadoop is an open source distributed computing platform developed by Apache. It has shown excellent performance in distributed computing and data storage, and has attracted the attention of well-known IT enterprises at home and abroad. Many large enterprises and scientific research institutions have put into research, making the application of Hadoop in cloud computing and cloud storage more and more extensive. HDFS is a distributed file system of Hadoop, which has powerful data storage ability and is suitable for cloud storage system. However, it has some defects in design, and its performance is not perfect. If it is to be widely used, it must be improved first. In this paper, the cloud storage model based on HDFS is studied, and the HDFS is improved on the problems of poor storage of small files and uneven distribution of replica. The improved HDFS is used to build cloud storage platform. The main work is as follows: 1. In order to ensure the reliability of data storage, HDFS uses replica mechanism to store copies of files in the cluster. File replicas are stored on different DataNode in the form of data blocks. However, the default replica distribution policy of HDFS is random and cannot guarantee the balanced distribution of replicas in the cluster. In order to solve this problem, this paper proposes an algorithm based on weighted evaluation index matrix to select the node nearest to the best solution and farthest from the worst solution. The weight of the node is determined by the analytic hierarchy process (AHP), and the load of the node is taken into account at the same time. Focus on space utilization, select the most appropriate DataNode to place data copies, so that the overall balance of the DataNode space load. 2. 2. HDFS is designed for large files and is not suitable for storage of large numbers of small files. With the same amount of data, small files waste NameNode memory and reduce access efficiency. In order to solve this problem, this paper improves the file stored procedure of HDFS, judges the file before uploading it to the HDFS cluster, and if it is a small file, it needs to combine and optimize. The index information of the small file is stored in the index file as a key-value pair. The improved scheme reduces the consumption of a large number of small files to NameNode memory, and improves the access efficiency. 3. 3. A large number of experiments have been carried out to compare the original HDFS with the improved scheme. The experimental results show that the proposed scheme has better effect and can improve the performance of HDFS. The improved Hadoop is used to build the storage cluster, develop the Web application program, simulate the cloud storage platform through the B / S mode, and realize the related function of cloud storage.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 陳濤;;云計算理論及技術(shù)研究[J];重慶交通大學(xué)學(xué)報(社會科學(xué)版);2009年04期
2 林偉偉;;一種改進的Hadoop數(shù)據(jù)放置策略[J];華南理工大學(xué)學(xué)報(自然科學(xué)版);2012年01期
3 董世曉;;云計算開源先鋒Hadoop——第四屆Hadoop中國云計算大會紀(jì)實[J];程序員;2010年10期
4 余寅輝;余鎮(zhèn)危;楊傳棟;張英;;SAN存儲系統(tǒng)的性能分析模型[J];計算機工程;2007年10期
5 欒亞建;黃爛
本文編號:2222021
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2222021.html
最近更新
教材專著