基于HDFS分布式并行文件系統(tǒng)副本策略研究
發(fā)布時(shí)間:2018-08-02 12:10
【摘要】:近年來(lái),隨著科學(xué)技術(shù)的進(jìn)一步發(fā)展,全球數(shù)據(jù)量出現(xiàn)高速增長(zhǎng),特別是更加注重用戶(hù)的交互作用的Web2.0的出現(xiàn),改變了過(guò)去用戶(hù)只能作為互聯(lián)網(wǎng)讀者的角色,用戶(hù)成為了互聯(lián)網(wǎng)內(nèi)容的創(chuàng)作者。在這樣的海量信息環(huán)境中,傳統(tǒng)的存儲(chǔ)系統(tǒng)已經(jīng)不能滿(mǎn)足信息量高速增長(zhǎng)的要求,在容量和性能的要求上存在瓶頸,諸如硬盤(pán)數(shù)量、服務(wù)器數(shù)量等的限制。 HDFS(Hadoop Distributed File System)是不同于傳統(tǒng)分布式并行文件系統(tǒng)的,運(yùn)行于廉價(jià)的機(jī)器上的,具有高吞吐量、高容錯(cuò)性、高可靠性的新型分布式文件系統(tǒng)。具有數(shù)據(jù)分布存儲(chǔ)與管理功能,并提供高性能的數(shù)據(jù)訪問(wèn)與交互。 在分布式并行文件系統(tǒng)HDFS中,副本是其重要的組成部分,副本技術(shù)更是協(xié)調(diào)互聯(lián)網(wǎng)中各個(gè)節(jié)點(diǎn)資源完成高效且工作量較大的任務(wù),實(shí)現(xiàn)這一任務(wù)的途徑即通過(guò)副本放置、副本選擇、副本調(diào)整等方式提高數(shù)據(jù)在各節(jié)點(diǎn)間的有效傳輸。 本文首先對(duì)副本管理策略的研究現(xiàn)狀作了分析,總結(jié)了前輩們?cè)谠擃I(lǐng)域已有的研究成果以及它們的局限性;在此基礎(chǔ)上對(duì)HDFS系統(tǒng)架構(gòu)及其讀寫(xiě)機(jī)制等關(guān)鍵技術(shù)進(jìn)行深入分析和闡述,并在此基礎(chǔ)上建立HDFS動(dòng)態(tài)副本管理模型,從副本放置和副本刪除兩個(gè)方面展開(kāi)了論述。然后,根據(jù)副本放置策略的改進(jìn)思想進(jìn)行算法的設(shè)計(jì),提出了基于距離和負(fù)載信息的副本放置策略,引進(jìn)平衡因子調(diào)節(jié)距離和負(fù)載的比重滿(mǎn)足不同用戶(hù)對(duì)系統(tǒng)的要求;同時(shí),根據(jù)副本調(diào)整階段的需求,改進(jìn)副本刪除策略,引入副本評(píng)價(jià)函數(shù),提出基于價(jià)值評(píng)估的副本刪除策略;最后,通過(guò)仿真模擬實(shí)驗(yàn),對(duì)本文提出的副本策略進(jìn)行有效性驗(yàn)證,并與HDFS默認(rèn)副本策略進(jìn)行對(duì)比分析。 本文的主要貢獻(xiàn)在于: 1)分析了HDFS分布式并行文件系統(tǒng)與傳統(tǒng)分布式系統(tǒng)的區(qū)別,重點(diǎn)與GFS進(jìn)行了對(duì)比分析,分析兩者的設(shè)計(jì)思想和原則,比較副本管理策略的異同,說(shuō)明HDFS是GFS的簡(jiǎn)化設(shè)計(jì),具有更加靈活的操作性。 2)提出了一種基于距離和負(fù)載信息的副本放置策略。該策略改變了HDFS默認(rèn)副本放置策略的隨機(jī)存儲(chǔ)算法,綜合考慮了副本大小、傳輸帶寬以及節(jié)點(diǎn)負(fù)載三方面影響因素,計(jì)算出節(jié)點(diǎn)的效用值,優(yōu)先選擇效用值大的節(jié)點(diǎn)存儲(chǔ)數(shù)據(jù)塊,并引入平衡因子,滿(mǎn)足不同用戶(hù)對(duì)系統(tǒng)性能的要求。最后模擬實(shí)驗(yàn)驗(yàn)證了本文算法在負(fù)載均衡上較HDFS默認(rèn)放置策略具有明顯的優(yōu)越性。 3)提出了一種基于價(jià)值評(píng)估的副本刪除策略。當(dāng)有新的副本寫(xiě)入請(qǐng)求時(shí),Namenode節(jié)點(diǎn)隨機(jī)獲取一組Datanode,選擇一個(gè)節(jié)點(diǎn)寫(xiě)入數(shù)據(jù)。若被選擇的節(jié)點(diǎn)已有副本數(shù)量太多,負(fù)載太重,性能就不能有效發(fā)揮;HDFS默認(rèn)副本調(diào)整策略沒(méi)有考慮到這一點(diǎn),改進(jìn)的策略通過(guò)價(jià)值評(píng)估函數(shù)計(jì)算副本的價(jià)值,并進(jìn)行排序,當(dāng)節(jié)點(diǎn)負(fù)載過(guò)大時(shí),刪除價(jià)值最小的副本,以此來(lái)釋放節(jié)點(diǎn)空間,充分發(fā)揮節(jié)點(diǎn)效用,實(shí)驗(yàn)表明,在大文件寫(xiě)入測(cè)試中,本文策略較HDFS默認(rèn)策略具備更高的性能。
[Abstract]:In recent years, with the further development of science and technology, the rapid growth of global data, especially the emergence of Web2.0 which pays more attention to the interaction of users, has changed the role of the past users only as the Internet reader, and the user has become the creator of the Internet content. In such a mass information environment, the traditional storage system It can not satisfy the requirement of high-speed growth of information, and there are bottlenecks in capacity and performance, such as the number of hard disks, the number of servers and so on.
HDFS (Hadoop Distributed File System) is a new distributed file system with high throughput, high fault tolerance and high reliability, which is different from the traditional distributed parallel file system. It has high throughput, high fault tolerance and high reliability. It has the function of data distribution and management, and provides high performance data access and interaction.
In the distributed parallel file system (HDFS), replica is an important part of the system. Replica technology is a task to coordinate the efficient and heavy workload of each node resource in the Internet. The way to achieve this task is to improve the effective transmission of data among the nodes through replica placement, copy selection, copy adjustment and so on.
This paper first analyzes the research status of copy management strategy, summarizes the research achievements and limitations of predecessors in this field. On this basis, it analyzes and expounds the key technologies such as HDFS system architecture and its reading and writing mechanism, and builds a HDFS dynamic copy management model on this basis. Two aspects are discussed. Then, the algorithm is designed based on the improved idea of replica placement strategy. A copy placement strategy based on distance and load information is proposed. The balance factor is introduced to adjust the proportion of the distance and load to meet the requirements of different users to the system; meanwhile, according to the requirement of the replica adjustment phase. The copy deleting strategy is improved, the replica evaluation function is introduced and the copy deletion strategy based on the value evaluation is proposed. Finally, the validity of the replica strategy is verified by the simulation experiment, and the HDFS default copy strategy is compared and analyzed.
The main contributions of this article are as follows:
1) the difference between the HDFS distributed parallel file system and the traditional distributed system is analyzed. The emphasis is compared with the GFS, the design ideas and principles of the two are analyzed, and the similarities and differences of the copy management strategy are compared. It shows that HDFS is a simplified design of GFS and has a more flexible operation.
2) a copy placement strategy based on distance and load information is proposed. This strategy changes the random storage algorithm of the HDFS default copy placement strategy. It considers the size of the replica, the transmission bandwidth and the three aspects of the node load, and calculates the utility value of the node. The balance factor can meet the requirements of different users for the performance of the system. Finally, the simulation experiment shows that the proposed algorithm is superior to the HDFS default placement strategy in load balancing.
3) a copy deletion strategy based on value evaluation is proposed. When a new copy is written to the request, the Namenode node randomly acquires a set of Datanode and selects one node to write the data. If the number of selected nodes has too many copies and the load is too heavy, the performance will not be effective; the HDFS default copy adjustment strategy does not take this into account. One point is that the improved strategy calculates the value of the copy through the value evaluation function and makes the sorting. When the node is overloaded, the minimum value of the copy is deleted to release the node space and give full play to the node utility. The experiment shows that the strategy has higher performance than the HDFS default strategy in the large file writing test.
【學(xué)位授予單位】:浙江師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP316.4;TP333
本文編號(hào):2159391
[Abstract]:In recent years, with the further development of science and technology, the rapid growth of global data, especially the emergence of Web2.0 which pays more attention to the interaction of users, has changed the role of the past users only as the Internet reader, and the user has become the creator of the Internet content. In such a mass information environment, the traditional storage system It can not satisfy the requirement of high-speed growth of information, and there are bottlenecks in capacity and performance, such as the number of hard disks, the number of servers and so on.
HDFS (Hadoop Distributed File System) is a new distributed file system with high throughput, high fault tolerance and high reliability, which is different from the traditional distributed parallel file system. It has high throughput, high fault tolerance and high reliability. It has the function of data distribution and management, and provides high performance data access and interaction.
In the distributed parallel file system (HDFS), replica is an important part of the system. Replica technology is a task to coordinate the efficient and heavy workload of each node resource in the Internet. The way to achieve this task is to improve the effective transmission of data among the nodes through replica placement, copy selection, copy adjustment and so on.
This paper first analyzes the research status of copy management strategy, summarizes the research achievements and limitations of predecessors in this field. On this basis, it analyzes and expounds the key technologies such as HDFS system architecture and its reading and writing mechanism, and builds a HDFS dynamic copy management model on this basis. Two aspects are discussed. Then, the algorithm is designed based on the improved idea of replica placement strategy. A copy placement strategy based on distance and load information is proposed. The balance factor is introduced to adjust the proportion of the distance and load to meet the requirements of different users to the system; meanwhile, according to the requirement of the replica adjustment phase. The copy deleting strategy is improved, the replica evaluation function is introduced and the copy deletion strategy based on the value evaluation is proposed. Finally, the validity of the replica strategy is verified by the simulation experiment, and the HDFS default copy strategy is compared and analyzed.
The main contributions of this article are as follows:
1) the difference between the HDFS distributed parallel file system and the traditional distributed system is analyzed. The emphasis is compared with the GFS, the design ideas and principles of the two are analyzed, and the similarities and differences of the copy management strategy are compared. It shows that HDFS is a simplified design of GFS and has a more flexible operation.
2) a copy placement strategy based on distance and load information is proposed. This strategy changes the random storage algorithm of the HDFS default copy placement strategy. It considers the size of the replica, the transmission bandwidth and the three aspects of the node load, and calculates the utility value of the node. The balance factor can meet the requirements of different users for the performance of the system. Finally, the simulation experiment shows that the proposed algorithm is superior to the HDFS default placement strategy in load balancing.
3) a copy deletion strategy based on value evaluation is proposed. When a new copy is written to the request, the Namenode node randomly acquires a set of Datanode and selects one node to write the data. If the number of selected nodes has too many copies and the load is too heavy, the performance will not be effective; the HDFS default copy adjustment strategy does not take this into account. One point is that the improved strategy calculates the value of the copy through the value evaluation function and makes the sorting. When the node is overloaded, the minimum value of the copy is deleted to release the node space and give full play to the node utility. The experiment shows that the strategy has higher performance than the HDFS default strategy in the large file writing test.
【學(xué)位授予單位】:浙江師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP316.4;TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 林偉偉;;一種改進(jìn)的Hadoop數(shù)據(jù)放置策略[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年01期
2 李麗英;唐卓;李仁發(fā);;基于LATE的Hadoop數(shù)據(jù)局部性改進(jìn)調(diào)度算法[J];計(jì)算機(jī)科學(xué);2011年11期
3 陳全;鄧倩妮;;云計(jì)算及其關(guān)鍵技術(shù)[J];計(jì)算機(jī)應(yīng)用;2009年09期
4 王龍;萬(wàn)振凱;;基于服務(wù)架構(gòu)的云計(jì)算研究及其實(shí)現(xiàn)[J];計(jì)算機(jī)與數(shù)字工程;2009年07期
5 利業(yè)韃;林偉偉;;一種Hadoop數(shù)據(jù)復(fù)制優(yōu)化方法[J];計(jì)算機(jī)工程與應(yīng)用;2012年21期
6 張靜;謝曉蘭;聶紹輝;;網(wǎng)格平臺(tái)下基于GT5的數(shù)據(jù)管理組件RLS的研究[J];軟件導(dǎo)刊;2010年08期
7 王慧娟;胡峰松;陳燦;;數(shù)據(jù)網(wǎng)格環(huán)境下副本淘汰策略的研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年19期
,本文編號(hào):2159391
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2159391.html
最近更新
教材專(zhuān)著