基于HDFS分布式并行文件系統(tǒng)副本策略研究
發(fā)布時間:2018-08-02 12:10
【摘要】:近年來,隨著科學(xué)技術(shù)的進(jìn)一步發(fā)展,全球數(shù)據(jù)量出現(xiàn)高速增長,特別是更加注重用戶的交互作用的Web2.0的出現(xiàn),改變了過去用戶只能作為互聯(lián)網(wǎng)讀者的角色,用戶成為了互聯(lián)網(wǎng)內(nèi)容的創(chuàng)作者。在這樣的海量信息環(huán)境中,傳統(tǒng)的存儲系統(tǒng)已經(jīng)不能滿足信息量高速增長的要求,在容量和性能的要求上存在瓶頸,諸如硬盤數(shù)量、服務(wù)器數(shù)量等的限制。 HDFS(Hadoop Distributed File System)是不同于傳統(tǒng)分布式并行文件系統(tǒng)的,運(yùn)行于廉價的機(jī)器上的,具有高吞吐量、高容錯性、高可靠性的新型分布式文件系統(tǒng)。具有數(shù)據(jù)分布存儲與管理功能,并提供高性能的數(shù)據(jù)訪問與交互。 在分布式并行文件系統(tǒng)HDFS中,副本是其重要的組成部分,副本技術(shù)更是協(xié)調(diào)互聯(lián)網(wǎng)中各個節(jié)點(diǎn)資源完成高效且工作量較大的任務(wù),實(shí)現(xiàn)這一任務(wù)的途徑即通過副本放置、副本選擇、副本調(diào)整等方式提高數(shù)據(jù)在各節(jié)點(diǎn)間的有效傳輸。 本文首先對副本管理策略的研究現(xiàn)狀作了分析,總結(jié)了前輩們在該領(lǐng)域已有的研究成果以及它們的局限性;在此基礎(chǔ)上對HDFS系統(tǒng)架構(gòu)及其讀寫機(jī)制等關(guān)鍵技術(shù)進(jìn)行深入分析和闡述,并在此基礎(chǔ)上建立HDFS動態(tài)副本管理模型,從副本放置和副本刪除兩個方面展開了論述。然后,根據(jù)副本放置策略的改進(jìn)思想進(jìn)行算法的設(shè)計,提出了基于距離和負(fù)載信息的副本放置策略,引進(jìn)平衡因子調(diào)節(jié)距離和負(fù)載的比重滿足不同用戶對系統(tǒng)的要求;同時,根據(jù)副本調(diào)整階段的需求,改進(jìn)副本刪除策略,引入副本評價函數(shù),提出基于價值評估的副本刪除策略;最后,通過仿真模擬實(shí)驗(yàn),對本文提出的副本策略進(jìn)行有效性驗(yàn)證,并與HDFS默認(rèn)副本策略進(jìn)行對比分析。 本文的主要貢獻(xiàn)在于: 1)分析了HDFS分布式并行文件系統(tǒng)與傳統(tǒng)分布式系統(tǒng)的區(qū)別,重點(diǎn)與GFS進(jìn)行了對比分析,分析兩者的設(shè)計思想和原則,比較副本管理策略的異同,說明HDFS是GFS的簡化設(shè)計,具有更加靈活的操作性。 2)提出了一種基于距離和負(fù)載信息的副本放置策略。該策略改變了HDFS默認(rèn)副本放置策略的隨機(jī)存儲算法,綜合考慮了副本大小、傳輸帶寬以及節(jié)點(diǎn)負(fù)載三方面影響因素,計算出節(jié)點(diǎn)的效用值,優(yōu)先選擇效用值大的節(jié)點(diǎn)存儲數(shù)據(jù)塊,并引入平衡因子,滿足不同用戶對系統(tǒng)性能的要求。最后模擬實(shí)驗(yàn)驗(yàn)證了本文算法在負(fù)載均衡上較HDFS默認(rèn)放置策略具有明顯的優(yōu)越性。 3)提出了一種基于價值評估的副本刪除策略。當(dāng)有新的副本寫入請求時,Namenode節(jié)點(diǎn)隨機(jī)獲取一組Datanode,選擇一個節(jié)點(diǎn)寫入數(shù)據(jù)。若被選擇的節(jié)點(diǎn)已有副本數(shù)量太多,負(fù)載太重,性能就不能有效發(fā)揮;HDFS默認(rèn)副本調(diào)整策略沒有考慮到這一點(diǎn),改進(jìn)的策略通過價值評估函數(shù)計算副本的價值,并進(jìn)行排序,當(dāng)節(jié)點(diǎn)負(fù)載過大時,刪除價值最小的副本,以此來釋放節(jié)點(diǎn)空間,充分發(fā)揮節(jié)點(diǎn)效用,實(shí)驗(yàn)表明,在大文件寫入測試中,本文策略較HDFS默認(rèn)策略具備更高的性能。
[Abstract]:In recent years, with the further development of science and technology, the rapid growth of global data, especially the emergence of Web2.0 which pays more attention to the interaction of users, has changed the role of the past users only as the Internet reader, and the user has become the creator of the Internet content. In such a mass information environment, the traditional storage system It can not satisfy the requirement of high-speed growth of information, and there are bottlenecks in capacity and performance, such as the number of hard disks, the number of servers and so on.
HDFS (Hadoop Distributed File System) is a new distributed file system with high throughput, high fault tolerance and high reliability, which is different from the traditional distributed parallel file system. It has high throughput, high fault tolerance and high reliability. It has the function of data distribution and management, and provides high performance data access and interaction.
In the distributed parallel file system (HDFS), replica is an important part of the system. Replica technology is a task to coordinate the efficient and heavy workload of each node resource in the Internet. The way to achieve this task is to improve the effective transmission of data among the nodes through replica placement, copy selection, copy adjustment and so on.
This paper first analyzes the research status of copy management strategy, summarizes the research achievements and limitations of predecessors in this field. On this basis, it analyzes and expounds the key technologies such as HDFS system architecture and its reading and writing mechanism, and builds a HDFS dynamic copy management model on this basis. Two aspects are discussed. Then, the algorithm is designed based on the improved idea of replica placement strategy. A copy placement strategy based on distance and load information is proposed. The balance factor is introduced to adjust the proportion of the distance and load to meet the requirements of different users to the system; meanwhile, according to the requirement of the replica adjustment phase. The copy deleting strategy is improved, the replica evaluation function is introduced and the copy deletion strategy based on the value evaluation is proposed. Finally, the validity of the replica strategy is verified by the simulation experiment, and the HDFS default copy strategy is compared and analyzed.
The main contributions of this article are as follows:
1) the difference between the HDFS distributed parallel file system and the traditional distributed system is analyzed. The emphasis is compared with the GFS, the design ideas and principles of the two are analyzed, and the similarities and differences of the copy management strategy are compared. It shows that HDFS is a simplified design of GFS and has a more flexible operation.
2) a copy placement strategy based on distance and load information is proposed. This strategy changes the random storage algorithm of the HDFS default copy placement strategy. It considers the size of the replica, the transmission bandwidth and the three aspects of the node load, and calculates the utility value of the node. The balance factor can meet the requirements of different users for the performance of the system. Finally, the simulation experiment shows that the proposed algorithm is superior to the HDFS default placement strategy in load balancing.
3) a copy deletion strategy based on value evaluation is proposed. When a new copy is written to the request, the Namenode node randomly acquires a set of Datanode and selects one node to write the data. If the number of selected nodes has too many copies and the load is too heavy, the performance will not be effective; the HDFS default copy adjustment strategy does not take this into account. One point is that the improved strategy calculates the value of the copy through the value evaluation function and makes the sorting. When the node is overloaded, the minimum value of the copy is deleted to release the node space and give full play to the node utility. The experiment shows that the strategy has higher performance than the HDFS default strategy in the large file writing test.
【學(xué)位授予單位】:浙江師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP316.4;TP333
本文編號:2159391
[Abstract]:In recent years, with the further development of science and technology, the rapid growth of global data, especially the emergence of Web2.0 which pays more attention to the interaction of users, has changed the role of the past users only as the Internet reader, and the user has become the creator of the Internet content. In such a mass information environment, the traditional storage system It can not satisfy the requirement of high-speed growth of information, and there are bottlenecks in capacity and performance, such as the number of hard disks, the number of servers and so on.
HDFS (Hadoop Distributed File System) is a new distributed file system with high throughput, high fault tolerance and high reliability, which is different from the traditional distributed parallel file system. It has high throughput, high fault tolerance and high reliability. It has the function of data distribution and management, and provides high performance data access and interaction.
In the distributed parallel file system (HDFS), replica is an important part of the system. Replica technology is a task to coordinate the efficient and heavy workload of each node resource in the Internet. The way to achieve this task is to improve the effective transmission of data among the nodes through replica placement, copy selection, copy adjustment and so on.
This paper first analyzes the research status of copy management strategy, summarizes the research achievements and limitations of predecessors in this field. On this basis, it analyzes and expounds the key technologies such as HDFS system architecture and its reading and writing mechanism, and builds a HDFS dynamic copy management model on this basis. Two aspects are discussed. Then, the algorithm is designed based on the improved idea of replica placement strategy. A copy placement strategy based on distance and load information is proposed. The balance factor is introduced to adjust the proportion of the distance and load to meet the requirements of different users to the system; meanwhile, according to the requirement of the replica adjustment phase. The copy deleting strategy is improved, the replica evaluation function is introduced and the copy deletion strategy based on the value evaluation is proposed. Finally, the validity of the replica strategy is verified by the simulation experiment, and the HDFS default copy strategy is compared and analyzed.
The main contributions of this article are as follows:
1) the difference between the HDFS distributed parallel file system and the traditional distributed system is analyzed. The emphasis is compared with the GFS, the design ideas and principles of the two are analyzed, and the similarities and differences of the copy management strategy are compared. It shows that HDFS is a simplified design of GFS and has a more flexible operation.
2) a copy placement strategy based on distance and load information is proposed. This strategy changes the random storage algorithm of the HDFS default copy placement strategy. It considers the size of the replica, the transmission bandwidth and the three aspects of the node load, and calculates the utility value of the node. The balance factor can meet the requirements of different users for the performance of the system. Finally, the simulation experiment shows that the proposed algorithm is superior to the HDFS default placement strategy in load balancing.
3) a copy deletion strategy based on value evaluation is proposed. When a new copy is written to the request, the Namenode node randomly acquires a set of Datanode and selects one node to write the data. If the number of selected nodes has too many copies and the load is too heavy, the performance will not be effective; the HDFS default copy adjustment strategy does not take this into account. One point is that the improved strategy calculates the value of the copy through the value evaluation function and makes the sorting. When the node is overloaded, the minimum value of the copy is deleted to release the node space and give full play to the node utility. The experiment shows that the strategy has higher performance than the HDFS default strategy in the large file writing test.
【學(xué)位授予單位】:浙江師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP316.4;TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 林偉偉;;一種改進(jìn)的Hadoop數(shù)據(jù)放置策略[J];華南理工大學(xué)學(xué)報(自然科學(xué)版);2012年01期
2 李麗英;唐卓;李仁發(fā);;基于LATE的Hadoop數(shù)據(jù)局部性改進(jìn)調(diào)度算法[J];計算機(jī)科學(xué);2011年11期
3 陳全;鄧倩妮;;云計算及其關(guān)鍵技術(shù)[J];計算機(jī)應(yīng)用;2009年09期
4 王龍;萬振凱;;基于服務(wù)架構(gòu)的云計算研究及其實(shí)現(xiàn)[J];計算機(jī)與數(shù)字工程;2009年07期
5 利業(yè)韃;林偉偉;;一種Hadoop數(shù)據(jù)復(fù)制優(yōu)化方法[J];計算機(jī)工程與應(yīng)用;2012年21期
6 張靜;謝曉蘭;聶紹輝;;網(wǎng)格平臺下基于GT5的數(shù)據(jù)管理組件RLS的研究[J];軟件導(dǎo)刊;2010年08期
7 王慧娟;胡峰松;陳燦;;數(shù)據(jù)網(wǎng)格環(huán)境下副本淘汰策略的研究[J];計算機(jī)工程與設(shè)計;2010年19期
,本文編號:2159391
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2159391.html
最近更新
教材專著