天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向文件級(jí)重復(fù)數(shù)據(jù)刪除的稀疏索引技術(shù)

發(fā)布時(shí)間:2018-03-01 03:32

  本文關(guān)鍵詞: 重復(fù)數(shù)據(jù)刪除技術(shù) 稀疏索引 虛擬機(jī)映像文件 文件重復(fù)局部性 出處:《國(guó)防科學(xué)技術(shù)大學(xué)》2012年碩士論文 論文類型:學(xué)位論文


【摘要】:重復(fù)數(shù)據(jù)刪除技術(shù)是近幾年來存儲(chǔ)領(lǐng)域的研究熱點(diǎn),在提高存儲(chǔ)利用率、減少數(shù)據(jù)傳輸帶寬占用量方面表現(xiàn)優(yōu)異,近幾年來廣泛應(yīng)用在數(shù)據(jù)備份系統(tǒng)、歸檔存儲(chǔ)系統(tǒng)、遠(yuǎn)程災(zāi)備等系統(tǒng)中。多數(shù)大規(guī)模數(shù)據(jù)中心中存在大量重復(fù)的數(shù)據(jù),極大的浪費(fèi)了存儲(chǔ)資源及能耗。云計(jì)算數(shù)據(jù)中心中存在大量虛擬機(jī)映像文件,這些虛擬機(jī)映像文件中存在大量的重復(fù)數(shù)據(jù)。刪除其中的重復(fù)數(shù)據(jù)不僅可以節(jié)省磁盤空間,也能減少虛擬機(jī)映像文件傳輸時(shí)占用的帶寬,提高虛擬機(jī)映像文件的訪問及分發(fā)速度。 重復(fù)數(shù)據(jù)刪除中普遍存在的磁盤訪問瓶頸問題制約了系統(tǒng)的性能,現(xiàn)有解決磁盤瓶頸問題的方法主要包括Data Domain提出的基于bloom filters、SISL和局部性保持,稀疏索引及Extreme Binning的解決方案。已有解決方案通常利用數(shù)據(jù)訪問的局部性來減少內(nèi)存中索引數(shù)量,減少每個(gè)索引查找時(shí)的磁盤I/O次數(shù)。這些研究方案在基于文件的更大粒度的重復(fù)數(shù)據(jù)刪除下并不適用,均無法達(dá)到解決重復(fù)數(shù)據(jù)刪除中的磁盤訪問瓶頸問題。 本文首先針對(duì)云計(jì)算中大量虛擬機(jī)映像文件下基于文件的重復(fù)數(shù)據(jù)刪除中存在的磁盤訪問瓶頸問題,提出了一種基于隨機(jī)抽樣的重復(fù)數(shù)據(jù)刪除方法。內(nèi)存中的索引不是全部的文件索引而是從全部文件索引中進(jìn)行隨機(jī)抽樣出的部分樣本索引,從而減少內(nèi)存中索引的總數(shù)量。在文件索引的重復(fù)檢測(cè)時(shí)利用虛擬機(jī)映像文件重復(fù)的局部性特征利用部分文件索引在內(nèi)存中命中的情況推斷目錄中其他文件索引的命中情況。使得目錄中每個(gè)文件索引的重復(fù)檢測(cè)工作無需多次訪問磁盤。通過設(shè)計(jì)實(shí)現(xiàn)一般情況下和隨機(jī)抽樣下的重復(fù)數(shù)據(jù)刪除算法,對(duì)比隨機(jī)抽樣重復(fù)數(shù)據(jù)刪除效果可知,當(dāng)內(nèi)存無法放置整個(gè)索引表的情況時(shí),隨機(jī)抽樣的重復(fù)數(shù)據(jù)刪除下內(nèi)存稀疏索引數(shù)為原有索引數(shù)的1/10,不僅保證了可觀的重復(fù)數(shù)據(jù)刪除比例,同時(shí)減少了大量的磁盤訪問量,解決了當(dāng)內(nèi)存無法放置整個(gè)索引表時(shí)檢測(cè)性能急劇下降的問題,提高了重復(fù)數(shù)據(jù)刪除系統(tǒng)的性能。 針對(duì)基于文件的重復(fù)數(shù)據(jù)檢測(cè)中索引容量瓶頸的問題,為提高稀疏索引下重復(fù)數(shù)據(jù)刪除比例,進(jìn)一步提出了基于分組目錄的重復(fù)數(shù)據(jù)刪除方式。首先對(duì)虛擬機(jī)映像文件的樹結(jié)構(gòu)進(jìn)行相近粒度的劃分為分組目錄,對(duì)分組目錄采用隨機(jī)抽樣及基于Broder理論的抽樣方法,利用分組目錄的樣本文件索引建立內(nèi)存的稀疏索引,在分組目錄的基礎(chǔ)上實(shí)現(xiàn)了重復(fù)數(shù)據(jù)刪除方案。實(shí)現(xiàn)方案并利用實(shí)驗(yàn)數(shù)據(jù)檢驗(yàn)并驗(yàn)證了虛擬機(jī)映像文件重復(fù)的局部性。針對(duì)不同抽樣比例因子對(duì)比了重復(fù)數(shù)據(jù)刪除效果,結(jié)果表明,當(dāng)抽樣內(nèi)存索引數(shù)為原有的1/10時(shí),利用虛擬機(jī)映像文件重復(fù)固有的局部性特征實(shí)現(xiàn)的基于分組目錄的稀疏索引能夠達(dá)到96%以上的重復(fù)數(shù)據(jù)刪除比例,減少了內(nèi)存索引數(shù)量,很好的避免了磁盤訪問瓶頸問題。實(shí)驗(yàn)分析分組目錄抽樣中隨機(jī)抽樣及基于Broder理論的抽樣方法下的重復(fù)數(shù)據(jù)刪除比例,通過設(shè)定不同分組目錄劃分大小分析基于分組目錄的重復(fù)數(shù)據(jù)刪除比例的影響因子。最后根據(jù)實(shí)驗(yàn)結(jié)果對(duì)比基于分組目錄稀疏索引和基于隨機(jī)抽樣的稀疏索引下的重復(fù)數(shù)據(jù)刪除效果。 針對(duì)集中式重復(fù)數(shù)據(jù)刪除系統(tǒng)有限的可擴(kuò)展性提出了分布式環(huán)境下的重復(fù)數(shù)據(jù)刪方案,,實(shí)現(xiàn)了重復(fù)數(shù)據(jù)刪除過程的分布并行化,數(shù)據(jù)的分布式存儲(chǔ),設(shè)計(jì)簡(jiǎn)單的路由算法從而使得多個(gè)數(shù)據(jù)節(jié)點(diǎn)間的獨(dú)立自治。提出了簡(jiǎn)單易行的數(shù)據(jù)遷移策略,并分析分布式環(huán)境下重復(fù)數(shù)據(jù)刪除方案的特點(diǎn)、可實(shí)施性及對(duì)整體系統(tǒng)性能的影響。避免分布式環(huán)境下數(shù)據(jù)節(jié)點(diǎn)間相互通信帶來的負(fù)面影響,實(shí)現(xiàn)數(shù)據(jù)節(jié)點(diǎn)間的獨(dú)立自治,達(dá)到重復(fù)數(shù)據(jù)刪除過程的分布并行化目的。
[Abstract]:Data deduplication technology is a hot research field of storage in recent years, in improving storage utilization, reduce the transmission bandwidth amount of outstanding performance in recent years is widely used in the data backup system, storage system, remote disaster recovery system. A large number of repeated data are most of the large scale data centers. A great waste of storage resources and energy consumption. A large number of cloud computing virtual machine image files in the data center, there are a lot of duplicate data of these virtual machines in the image file. Delete duplicate data which can not only save disk space, but also can reduce the virtual machine image file transmission bandwidth, improve the virtual machine image file access and the distribution of speed.
Repeat data exists delete disk access bottleneck of the system performance, the existing methods of solving the bottleneck problem of the disk is mainly composed of Bloom filters based on Data proposed by Domain, SISL and local, sparse solution index and Extreme Binning. Have local solutions usually use data to reduce the number of memory access in each index, reduce the number of disk I/O search index. These methods are not applicable in duplicate data based on larger size of file deletion, were unable to duplicate data delete to solve the bottleneck problem in disk access.
In this paper, a large number of virtual machines in cloud computing image file exists under duplicate data delete file in disk based on the access bottleneck problem, proposes a method to delete duplicate data based on random sampling. In memory index not all file index but some sample index sampling from all the documents in the index, so to reduce the total amount of memory in the index. The situation that hit local character repeatedly hit list in memory by file indexing using virtual machine image file in the file when the duplicate detection index in the index file. He makes repeated detection of each file in the directory index without multiple access disk. By design the duplicate data in general and random sampling deletion algorithm, compared to random sampling data deduplication result shows that when the memory The index table can not be placed in case of repeated random sampling data, delete the memory for the original sparse index number index number 1/10, not only to ensure that the considerable de duplication ratio, while reducing the number of disk accesses, solve the detection performance decreased sharply when the memory can not be placed the entire index table when problems to improve the performance of data deduplication system.
According to the repeated data file detection index capacity bottleneck problem based on repeated data to improve the sparse index under the delete ratio, further proposed to delete duplicate data packets based on directory. The first tree structure on the virtual machine image files are divided into similar size packet to packet directory directory, by random sampling and sampling method based on Broder theory, a sparse index using the sample memory file index directory packet, packet based directory on the realization of the deduplication scheme. And the implement scheme according to test and verify the local virtual machine image file by using the number of repeat. According to different sampling scale factor comparison data deduplication effect results show that when sampling memory indexes for the original 1/10, realize the local characteristics of the virtual machine image file repeat inherent base In the sparse index directory can achieve the duplicate data packet more than 96% delete proportion, reduce the memory index number, very good to avoid the disk access bottlenecks. Experimental analysis of random grouping and duplicate data catalogue sampling sampling method based on Broder theory under the delete proportion, by setting different packet size effects of repeated directory partition delete directory data packet ratio based on factor. Finally, according to the experimental results based on a comparison of group index sparse index and remove duplicate data sparse index random sampling based on the results.
For the centralized data deduplication system limited scalability of duplicate data delete scheme under the distributed environment, the realization of distributed deduplication process in parallel, distributed data storage, routing algorithm design simple so that a plurality of data between nodes. Autonomous migration strategy simple data is presented, and analysis of characteristics of repeated data in a distributed environment to delete the program, the implementation and impact on the overall performance of the system. To avoid the negative impact of distributed environment of data communication between the nodes, the data nodes autonomous, parallel to the distribution to the deduplication process.

【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP333

【參考文獻(xiàn)】

相關(guān)期刊論文 前1條

1 敖莉;舒繼武;李明強(qiáng);;重復(fù)數(shù)據(jù)刪除技術(shù)[J];軟件學(xué)報(bào);2010年05期



本文編號(hào):1550242

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1550242.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶68288***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com