基于Hadoop的云存儲(chǔ)平臺(tái)的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-09-04 11:51
【摘要】:近年來(lái),云計(jì)算日益成為國(guó)內(nèi)外關(guān)注的焦點(diǎn)。當(dāng)云計(jì)算系統(tǒng)中運(yùn)算和處理的核心是大量數(shù)據(jù)的存儲(chǔ)時(shí),云計(jì)算系統(tǒng)就衍變?yōu)橐粋(gè)云存儲(chǔ)系統(tǒng)。云計(jì)算的飛速發(fā)展,使云存儲(chǔ)也成為當(dāng)前業(yè)界最熱門的研究領(lǐng)域。云存儲(chǔ)作為一種新的服務(wù),,它將用戶的數(shù)據(jù)存儲(chǔ)在云端服務(wù)器上,用戶只要通過(guò)互聯(lián)網(wǎng)登錄云存儲(chǔ)服務(wù)系統(tǒng),就可以在任何地方任何時(shí)候訪問(wèn)自己的數(shù)據(jù),并且不用擔(dān)心數(shù)據(jù)會(huì)丟失。 Hadoop是Apache開發(fā)的一種開源的分布式計(jì)算平臺(tái),在分布式計(jì)算和數(shù)據(jù)存儲(chǔ)方面表現(xiàn)出優(yōu)異的性能,引起了國(guó)內(nèi)外知名IT企業(yè)的關(guān)注,各大企業(yè)和科研機(jī)構(gòu)紛紛投入研究,使得Hadoop在云計(jì)算和云存儲(chǔ)中的應(yīng)用越來(lái)越廣泛。HDFS是Hadoop的分布式文件系統(tǒng),它具有強(qiáng)大的數(shù)據(jù)存儲(chǔ)能力,適合云存儲(chǔ)系統(tǒng)。但它在設(shè)計(jì)上存在一些缺陷,性能上并不完美,要想大規(guī)模推廣使用,必須先進(jìn)行改進(jìn)。 本文主要研究基于HDFS的云存儲(chǔ)模型,針對(duì)HDFS在小文件存儲(chǔ)不理想和副本分布不均衡兩個(gè)問(wèn)題上對(duì)其進(jìn)行改進(jìn),并使用改進(jìn)后的HDFS搭建云存儲(chǔ)平臺(tái)。主要工作如下: 1. HDFS為確保數(shù)據(jù)存儲(chǔ)的可靠性,采用副本機(jī)制將文件的副本存儲(chǔ)在集群中。文件副本以數(shù)據(jù)塊的形式存放在不同的DataNode上,然而HDFS默認(rèn)的副本分布策略具有隨機(jī)性,不能保證副本均衡地分布在集群中。為解決這一問(wèn)題,本文提出了一種基于加權(quán)評(píng)價(jià)指標(biāo)矩陣選擇距離最優(yōu)解最近、最差解最遠(yuǎn)的節(jié)點(diǎn)的算法,對(duì)權(quán)值的確定采用層次分析法進(jìn)行計(jì)算,在兼顧節(jié)點(diǎn)負(fù)載的同時(shí),著重考察空間使用率,選擇最合適的DataNode來(lái)放置數(shù)據(jù)副本,使各DataNode的空間負(fù)載整體均衡。 2. HDFS是為大文件設(shè)計(jì)的,不適合大量小文件的存儲(chǔ)。相同數(shù)據(jù)量情況下,小文件會(huì)浪費(fèi)NameNode的內(nèi)存,同時(shí)降低訪問(wèn)效率。針對(duì)這一問(wèn)題,本文對(duì)HDFS的文件存儲(chǔ)過(guò)程進(jìn)行改進(jìn),在文件上傳到HDFS集群之前先進(jìn)行判斷,如果是小文件則需要進(jìn)行合并優(yōu)化處理,并將小文件的索引信息以鍵值對(duì)的形式保存在索引文件中。改進(jìn)方案減小了大量小文件對(duì)NameNode內(nèi)存的消耗,并提高了訪問(wèn)效率。 3.進(jìn)行大量實(shí)驗(yàn),將原HDFS與改進(jìn)方案進(jìn)行對(duì)比,實(shí)驗(yàn)結(jié)果證明,本文提出的改進(jìn)方案具有更好的效果,能夠改善HDFS的性能。使用改進(jìn)后的Hadoop搭建存儲(chǔ)集群,開發(fā)Web應(yīng)用程序,通過(guò)B/S模式模擬云存儲(chǔ)平臺(tái),實(shí)現(xiàn)云存儲(chǔ)的相關(guān)功能。
[Abstract]:In recent years, cloud computing has increasingly become the focus of attention at home and abroad. When the core of computing and processing in cloud computing system is the storage of a lot of data, cloud computing system evolves into a cloud storage system. With the rapid development of cloud computing, cloud storage has become the hottest research field. Cloud storage, as a new service, stores the user's data on the cloud server. As long as the user logs on to the cloud storage service system through the Internet, he can access his data anywhere at any time. Hadoop is an open source distributed computing platform developed by Apache. It has shown excellent performance in distributed computing and data storage, and has attracted the attention of well-known IT enterprises at home and abroad. Many large enterprises and scientific research institutions have put into research, making the application of Hadoop in cloud computing and cloud storage more and more extensive. HDFS is a distributed file system of Hadoop, which has powerful data storage ability and is suitable for cloud storage system. However, it has some defects in design, and its performance is not perfect. If it is to be widely used, it must be improved first. In this paper, the cloud storage model based on HDFS is studied, and the HDFS is improved on the problems of poor storage of small files and uneven distribution of replica. The improved HDFS is used to build cloud storage platform. The main work is as follows: 1. In order to ensure the reliability of data storage, HDFS uses replica mechanism to store copies of files in the cluster. File replicas are stored on different DataNode in the form of data blocks. However, the default replica distribution policy of HDFS is random and cannot guarantee the balanced distribution of replicas in the cluster. In order to solve this problem, this paper proposes an algorithm based on weighted evaluation index matrix to select the node nearest to the best solution and farthest from the worst solution. The weight of the node is determined by the analytic hierarchy process (AHP), and the load of the node is taken into account at the same time. Focus on space utilization, select the most appropriate DataNode to place data copies, so that the overall balance of the DataNode space load. 2. 2. HDFS is designed for large files and is not suitable for storage of large numbers of small files. With the same amount of data, small files waste NameNode memory and reduce access efficiency. In order to solve this problem, this paper improves the file stored procedure of HDFS, judges the file before uploading it to the HDFS cluster, and if it is a small file, it needs to combine and optimize. The index information of the small file is stored in the index file as a key-value pair. The improved scheme reduces the consumption of a large number of small files to NameNode memory, and improves the access efficiency. 3. 3. A large number of experiments have been carried out to compare the original HDFS with the improved scheme. The experimental results show that the proposed scheme has better effect and can improve the performance of HDFS. The improved Hadoop is used to build the storage cluster, develop the Web application program, simulate the cloud storage platform through the B / S mode, and realize the related function of cloud storage.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP333
[Abstract]:In recent years, cloud computing has increasingly become the focus of attention at home and abroad. When the core of computing and processing in cloud computing system is the storage of a lot of data, cloud computing system evolves into a cloud storage system. With the rapid development of cloud computing, cloud storage has become the hottest research field. Cloud storage, as a new service, stores the user's data on the cloud server. As long as the user logs on to the cloud storage service system through the Internet, he can access his data anywhere at any time. Hadoop is an open source distributed computing platform developed by Apache. It has shown excellent performance in distributed computing and data storage, and has attracted the attention of well-known IT enterprises at home and abroad. Many large enterprises and scientific research institutions have put into research, making the application of Hadoop in cloud computing and cloud storage more and more extensive. HDFS is a distributed file system of Hadoop, which has powerful data storage ability and is suitable for cloud storage system. However, it has some defects in design, and its performance is not perfect. If it is to be widely used, it must be improved first. In this paper, the cloud storage model based on HDFS is studied, and the HDFS is improved on the problems of poor storage of small files and uneven distribution of replica. The improved HDFS is used to build cloud storage platform. The main work is as follows: 1. In order to ensure the reliability of data storage, HDFS uses replica mechanism to store copies of files in the cluster. File replicas are stored on different DataNode in the form of data blocks. However, the default replica distribution policy of HDFS is random and cannot guarantee the balanced distribution of replicas in the cluster. In order to solve this problem, this paper proposes an algorithm based on weighted evaluation index matrix to select the node nearest to the best solution and farthest from the worst solution. The weight of the node is determined by the analytic hierarchy process (AHP), and the load of the node is taken into account at the same time. Focus on space utilization, select the most appropriate DataNode to place data copies, so that the overall balance of the DataNode space load. 2. 2. HDFS is designed for large files and is not suitable for storage of large numbers of small files. With the same amount of data, small files waste NameNode memory and reduce access efficiency. In order to solve this problem, this paper improves the file stored procedure of HDFS, judges the file before uploading it to the HDFS cluster, and if it is a small file, it needs to combine and optimize. The index information of the small file is stored in the index file as a key-value pair. The improved scheme reduces the consumption of a large number of small files to NameNode memory, and improves the access efficiency. 3. 3. A large number of experiments have been carried out to compare the original HDFS with the improved scheme. The experimental results show that the proposed scheme has better effect and can improve the performance of HDFS. The improved Hadoop is used to build the storage cluster, develop the Web application program, simulate the cloud storage platform through the B / S mode, and realize the related function of cloud storage.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 陳濤;;云計(jì)算理論及技術(shù)研究[J];重慶交通大學(xué)學(xué)報(bào)(社會(huì)科學(xué)版);2009年04期
2 林偉偉;;一種改進(jìn)的Hadoop數(shù)據(jù)放置策略[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年01期
3 董世曉;;云計(jì)算開源先鋒Hadoop——第四屆Hadoop中國(guó)云計(jì)算大會(huì)紀(jì)實(shí)[J];程序員;2010年10期
4 余寅輝;余鎮(zhèn)危;楊傳棟;張英;;SAN存儲(chǔ)系統(tǒng)的性能分析模型[J];計(jì)算機(jī)工程;2007年10期
5 欒亞建;黃爛
本文編號(hào):2222021
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2222021.html
最近更新
教材專著