面向數(shù)據(jù)備份的高效數(shù)據(jù)去重系統(tǒng)構(gòu)建方法研究
發(fā)布時(shí)間:2018-05-19 21:42
本文選題:數(shù)據(jù)去重 + 數(shù)據(jù)塊碎片; 參考:《華中科技大學(xué)》2016年博士論文
【摘要】:在大數(shù)據(jù)時(shí)代,如何高效地存儲(chǔ)和管理海量數(shù)據(jù)成為存儲(chǔ)系統(tǒng)研究者和實(shí)踐者面臨的一大挑戰(zhàn)。大量的研究表明,冗余數(shù)據(jù)普遍存在于各類(lèi)存儲(chǔ)系統(tǒng)中,例如備份存儲(chǔ)系統(tǒng)、桌面文件系統(tǒng)等。通過(guò)消除這些冗余數(shù)據(jù),可以節(jié)約大量存儲(chǔ)成本。在這樣的背景下,數(shù)據(jù)去重作為一種高效的壓縮技術(shù),逐漸應(yīng)用到各類(lèi)存儲(chǔ)系統(tǒng)中。然而,構(gòu)建高效數(shù)據(jù)去重系統(tǒng)仍然存在大量的問(wèn)題和挑戰(zhàn),例如數(shù)據(jù)塊碎片、大規(guī)模指紋索引和存儲(chǔ)可靠性等。本文將首先在備份存儲(chǔ)系統(tǒng)中解決由于數(shù)據(jù)去重產(chǎn)生的數(shù)據(jù)塊碎片問(wèn)題,再研究數(shù)據(jù)去重對(duì)存儲(chǔ)可靠性的影響,最后系統(tǒng)地討論如何設(shè)計(jì)面向備份負(fù)載的高效數(shù)據(jù)去重系統(tǒng)。數(shù)據(jù)去重引起的數(shù)據(jù)塊碎片會(huì)嚴(yán)重降低備份數(shù)據(jù)流的恢復(fù)性能,并且在用戶(hù)刪除備份后,降低垃圾回收的效率。通過(guò)對(duì)長(zhǎng)期備份數(shù)據(jù)集的分析,發(fā)現(xiàn)數(shù)據(jù)塊碎片主要來(lái)自?xún)深?lèi)容器:稀疏容器和亂序容器。稀疏容器直接放大讀操作,而亂序容器會(huì)在恢復(fù)緩存不足時(shí)降低恢復(fù)性能,因此需要不同的解決方案。現(xiàn)有基于緩沖區(qū)的重寫(xiě)算法無(wú)法準(zhǔn)確區(qū)分稀疏和亂序容器,導(dǎo)致存儲(chǔ)效率和恢復(fù)性能低。提出了歷史感知的數(shù)據(jù)塊碎片處理方法,包括歷史感知重寫(xiě)算法HAR、最優(yōu)恢復(fù)緩存算法OPT、緩存感知過(guò)濾器CAF和容器標(biāo)記算法CMA。歷史感知的數(shù)據(jù)塊碎片處理方法對(duì)稀疏容器執(zhí)行重寫(xiě),且通過(guò)恢復(fù)緩存減少亂序容器的影響,從而減少存儲(chǔ)開(kāi)銷(xiāo)。HAR利用連續(xù)備份數(shù)據(jù)流的相似性精確地識(shí)別稀疏容器,是區(qū)分稀疏容器和亂序容器的關(guān)鍵。OPT在備份時(shí)紀(jì)錄下數(shù)據(jù)塊的訪問(wèn)順序,用于實(shí)現(xiàn)Belady最優(yōu)緩存算法,減少亂序容器對(duì)恢復(fù)性能的影響。為了進(jìn)一步減少對(duì)恢復(fù)緩存的需求,CAF通過(guò)仿真恢復(fù)緩存準(zhǔn)確識(shí)別和重寫(xiě)少量的降低恢復(fù)性能的亂序容器。為了減少垃圾回收的時(shí)間和空間開(kāi)銷(xiāo),CMA利用HAR清理稀疏容器。在先進(jìn)先出的備份刪除策略下,CMA無(wú)需耗時(shí)的容器合并操作即可回收大量存儲(chǔ)空間。由于CMA直接追蹤容器的利用率,其開(kāi)銷(xiāo)與容器數(shù)量成正比,而不是與數(shù)據(jù)塊數(shù)量成正比。使用4個(gè)長(zhǎng)期備份數(shù)據(jù)集進(jìn)行測(cè)試,歷史感知的數(shù)據(jù)塊碎片處理方法比現(xiàn)有算法的存儲(chǔ)成本更低,且恢復(fù)性能更好。數(shù)據(jù)去重對(duì)存儲(chǔ)可靠性的影響長(zhǎng)期未知。通過(guò)消除冗余數(shù)據(jù),數(shù)據(jù)去重可以減少磁盤(pán)的數(shù)量,因此減少遇到磁盤(pán)錯(cuò)誤的概率;同時(shí),數(shù)據(jù)去重會(huì)增加每次磁盤(pán)錯(cuò)誤的嚴(yán)重性,因?yàn)閬G失一個(gè)數(shù)據(jù)塊可能會(huì)損壞多個(gè)文件。提出了一套數(shù)據(jù)去重系統(tǒng)可靠性的量化分析方法,引入邏輯數(shù)據(jù)丟失量的概念擴(kuò)展了現(xiàn)有可靠性指標(biāo)NOMDL,使其可以衡量數(shù)據(jù)去重系統(tǒng)的可靠性。設(shè)計(jì)了針對(duì)數(shù)據(jù)去重系統(tǒng)的可靠性仿真器SIMD。 SIMD利用企業(yè)公布的統(tǒng)計(jì)數(shù)據(jù)仿真扇區(qū)錯(cuò)誤和整盤(pán)故障,并產(chǎn)生磁盤(pán)陣列的各類(lèi)數(shù)據(jù)丟失事件。為了計(jì)算每次數(shù)據(jù)丟失事件的邏輯數(shù)據(jù)丟失量,SIMD根據(jù)真實(shí)文件系統(tǒng)鏡像生成塊級(jí)和文件級(jí)模型。通過(guò)對(duì)18個(gè)真實(shí)文件系統(tǒng)鏡像的分析和仿真實(shí)驗(yàn),發(fā)現(xiàn):由于文件內(nèi)部冗余的存在,數(shù)據(jù)去重可以顯著減少扇區(qū)錯(cuò)誤損壞的文件數(shù)量;但是,數(shù)據(jù)去重帶來(lái)的數(shù)據(jù)塊碎片增加了整盤(pán)故障的危害。為了提高存儲(chǔ)可靠性,提出了DCT副本技術(shù)。DCT將磁盤(pán)陣列的1%物理空間分配給高引用數(shù)據(jù)塊的副本,并在磁盤(pán)陣列重建時(shí)優(yōu)先修復(fù)這些副本。DCT的存儲(chǔ)開(kāi)銷(xiāo)很小,可以將因整盤(pán)故障丟失的數(shù)據(jù)塊和文件分別減少31.8%和25.8%。構(gòu)建高效數(shù)據(jù)去重系統(tǒng)還需要考慮指紋索引等其它模塊對(duì)系統(tǒng)的影響。為了系統(tǒng)地理解和比較現(xiàn)有的設(shè)計(jì)方案,并提出新的更高效的設(shè)計(jì)方案,設(shè)計(jì)和實(shí)現(xiàn)了通用數(shù)據(jù)去重原型系統(tǒng)Destoro將數(shù)據(jù)去重系統(tǒng)理解為多維參數(shù)空間,每個(gè)維度代表了系統(tǒng)的一個(gè)子模塊或參數(shù),包括分塊、指紋索引、重寫(xiě)算法、恢復(fù)算法等。每個(gè)參數(shù)都有若干候選設(shè)計(jì)方案,F(xiàn)有的設(shè)計(jì)方案和潛在的新方案被看作多維參數(shù)空間的點(diǎn)。Destor實(shí)現(xiàn)了該多維參數(shù)空間,涵蓋了多種主流的數(shù)據(jù)去重系統(tǒng)的設(shè)計(jì)方案。研究人員利用Destor可以比較現(xiàn)有設(shè)計(jì)方案,并探索參數(shù)空間得到潛在的新設(shè)計(jì)方案。為了發(fā)現(xiàn)更高效的潛在設(shè)計(jì)方案,用3個(gè)長(zhǎng)期備份數(shù)據(jù)集探索了數(shù)據(jù)去重系統(tǒng)的參數(shù)空間,關(guān)注的性能指標(biāo)包括內(nèi)存開(kāi)銷(xiāo)、存儲(chǔ)成本、備份性能和恢復(fù)性能。目標(biāo)方案必須能長(zhǎng)期維持穩(wěn)定的高備份性能,并在其余三個(gè)指標(biāo)取得合理的權(quán)衡。得到了17個(gè)實(shí)驗(yàn)發(fā)現(xiàn),并總結(jié)了以下滿(mǎn)足要求的設(shè)計(jì)方案:當(dāng)需求最低的存儲(chǔ)成本時(shí),應(yīng)該采用利用邏輯局部性的精確去重;當(dāng)需求最低的內(nèi)存開(kāi)銷(xiāo)時(shí),可以采用利用邏輯或物理局部性的近似去重;當(dāng)需求穩(wěn)定的高恢復(fù)性能時(shí),應(yīng)該采用利用物理局部性的精確去重,以及歷史感知的數(shù)據(jù)塊消除方法。當(dāng)需求更高可靠性時(shí),以上方案都可以采用DCT副本技術(shù),僅增加極少存儲(chǔ)成本而不影響備份和恢復(fù)性能。
[Abstract]:In the era of large data, how to store and manage massive data efficiently has become a major challenge for the researchers and practitioners of the storage system. A large number of studies show that redundant data are ubiquitous in various storage systems, such as backup storage systems, desktop file systems and so on. By eliminating these redundant data, a large amount of storage can be saved. In this context, data deweighting, as an efficient compression technology, is gradually applied to all kinds of storage systems. However, there are still a lot of problems and challenges in the construction of high efficient data removal systems, such as data block fragments, large scale fingerprint index and storage reliability. This paper will first solve the problem in the backup storage system. The problem of data block fragmentation generated by data deweighting, and then studying the effect of data weight on storage reliability. Finally, this paper systematically discusses how to design an efficient data deweighting system for backup load. The data block fragments caused by data deweighting will seriously reduce the recovery ability of the backup data flow, and reduce the garbage returns after the user deletes the backup. Efficiency. Through the analysis of long-term backup datasets, it is found that data block fragments are mainly from two types of containers: sparse container and disorder container. Sparse container enlarging the read operation directly, and disorderly container will reduce recovery performance when the recovery cache is insufficient. Therefore, different solutions are needed. The existing rewriting algorithm based on buffer zone is not The method accurately distinguishes sparse and disorder containers, resulting in low storage efficiency and low recovery performance. A history aware block fragment processing method, including historical perception rewriting algorithm HAR, optimal recovery cache algorithm OPT, cache perception filter CAF and container markup algorithm CMA. history sense fragment processing method to sparse container, is proposed. Rewriting, and reducing the impact of the disordered container by restoring the cache, thereby reducing the storage overhead.HAR using the similarity of the continuous backup data stream to accurately identify the sparse container. It is the key.OPT to distinguish the access order of the data blocks in the backup time, which is used to realize the Belady optimal cache algorithm and reduce the disorder sequence. The effect of the container on recovery performance. In order to further reduce the demand for recovery caching, CAF accurately identifies and rewrites a small number of disorderly containers that reduce recovery performance through the simulation recovery cache. In order to reduce the time and space overhead of garbage collection, the CMA uses HAR to clean up the sparse container. Under the advanced first out backup and delete strategy, CMA does not need to A time-consuming container combined operation can recover a large amount of storage space. Since CMA directly tracks the utilization of the container, its overhead is directly proportional to the number of containers, not the number of data blocks. Using 4 long-term backup datasets to test, the history aware block fragmentation processing method is lower than the existing algorithm, and the storage cost is much lower than the existing algorithm. Complex performance is better. The effect of data removal on storage reliability is unknown for a long time. By eliminating redundant data, data deweighting can reduce the number of disks, thus reducing the probability of encountered disk errors; at the same time, data removal increases the severity of each disk error, because the loss of a block may damage multiple files. A quantitative analysis method of the reliability of data deweighting system is introduced, and the existing reliability index NOMDL is expanded by introducing the concept of logical data loss, so that it can measure the reliability of the data deweighting system. The reliability simulator SIMD. SIMD for the data deweighting system is designed to make use of the statistical data simulation sector error and integration published by the enterprise. Disk failure and all kinds of data loss events of disk array. In order to calculate the logical data loss of each data loss event, SIMD generates block level and file level model according to the real file system image. Through the analysis and simulation experiments of 18 real file system mirrors, it is found that the data is redundant in the file. Deweighting can significantly reduce the number of files damaged by the sector; however, the data block fragments brought by the data deweighting add to the damage of the whole disk. In order to improve the storage reliability, the DCT copy technology.DCT assigns the 1% physical space of the disk array to a copy of the high reference block, and the priority of the repair when the disk array is rebuilt. The storage cost of these replicas.DCT is very small, and the effect of other modules such as fingerprint index and other modules on the system can be considered by reducing the data blocks and files lost by the whole disk and file respectively by 31.8% and 25.8%. to build the high efficient data deweighting system. A general data deweighting prototype system Destoro is designed and implemented to understand the data deweighting system as a multidimensional parameter space. Each dimension represents a sub module or parameter of the system, including block, fingerprint index, rewriting algorithm and recovery algorithm. Each parameter has a number of candidate designs. The existing design scheme and potential new scheme are available. The point.Destor, which is considered as a multidimensional parameter space, implements the multidimensional parameter space and covers a variety of mainstream data deweighting systems. The researchers use Destor to compare existing design schemes and explore the potential new design schemes in parameter space. In order to find more efficient potential design schemes, 3 long-term backups are used. The data set explored the parameter space of the data deweighting system. The performance indicators concerned include memory overhead, storage cost, backup performance and recovery performance. The target scheme must maintain stable high backup performance for a long time and make a reasonable tradeoff between the remaining three indicators. 17 experimental findings are obtained, and the following requirements are summarized. Design scheme: when the minimum storage cost is required, the precise removal of the logical locality should be used; when the minimum memory cost is required, the approximation of the logic or physical locality can be used; when the high recovery performance is stable, the precise removal of the physical locality should be used, and the historical perception should be used. When the requirements are higher, the above schemes can use DCT replica technology, which only increases the minimal storage cost without affecting the backup and recovery performance.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP333
,
本文編號(hào):1911829
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1911829.html
最近更新
教材專(zhuān)著