面向云環(huán)境的重復(fù)數(shù)據(jù)刪除關(guān)鍵技術(shù)研究
發(fā)布時(shí)間:2018-08-25 15:01
【摘要】:隨著大數(shù)據(jù)時(shí)代的到來,信息世界的數(shù)據(jù)量呈爆炸式增長,數(shù)據(jù)中心的數(shù)據(jù)存儲和管理需求已達(dá)到PB級甚至EB級。研究發(fā)現(xiàn),不論是在備份、歸檔存儲層,還是在常規(guī)的主存儲層,日趨復(fù)雜的海量數(shù)據(jù)集中都有大量的重復(fù)數(shù)據(jù)。傳統(tǒng)的數(shù)據(jù)備份技術(shù)和虛擬機(jī)鏡像存儲管理方法更是加速了重復(fù)數(shù)據(jù)的增長。為了抑制數(shù)據(jù)過快增長,提高IT資源利用率,降低系統(tǒng)能耗以及管理成本,重復(fù)數(shù)據(jù)刪除技術(shù)作為一種新興的數(shù)據(jù)縮減技術(shù),已成為當(dāng)前學(xué)術(shù)界和工業(yè)界的研究熱點(diǎn)。 云計(jì)算作為大數(shù)據(jù)的關(guān)鍵支撐技術(shù),通過網(wǎng)絡(luò)計(jì)算和虛擬化技術(shù)優(yōu)化資源利用率,為用戶提供廉價(jià)、高效、可靠的計(jì)算和存儲服務(wù)。針對具有大量冗余數(shù)據(jù)的云備份和虛擬桌面云環(huán)境,重復(fù)數(shù)據(jù)刪除技術(shù)能夠極大地降低存儲空間需求和提高網(wǎng)絡(luò)帶寬利用率,但也存在系統(tǒng)性能上的挑戰(zhàn)。本文主要討論:如何利用重復(fù)數(shù)據(jù)刪除技術(shù)優(yōu)化個(gè)人計(jì)算環(huán)境云備份服務(wù)、數(shù)據(jù)中心分布式云備份存儲系統(tǒng)以及虛擬桌面云集群存儲系統(tǒng),以提高IT資源利用率和系統(tǒng)擴(kuò)展性,降低數(shù)據(jù)消重操作對I/O性能的影響。本文在全面了解當(dāng)前云計(jì)算技術(shù)發(fā)展現(xiàn)狀的基礎(chǔ)上,深入分析和研究了基于重復(fù)數(shù)據(jù)刪除技術(shù)的云備份、大數(shù)據(jù)備份和虛擬桌面云等應(yīng)用,并提出了新的系統(tǒng)設(shè)計(jì)和算法。主要工作和創(chuàng)新如下: (1)提出了基于個(gè)人計(jì)算環(huán)境云備份服務(wù)的分級應(yīng)用感知源端重復(fù)數(shù)據(jù)刪除機(jī)制ALG-Dedupe。本文通過對大量個(gè)人應(yīng)用數(shù)據(jù)進(jìn)行統(tǒng)計(jì)分析,首次發(fā)現(xiàn)了不同類型應(yīng)用數(shù)據(jù)集之間共享的數(shù)據(jù)量可以忽略不計(jì)。利用文件語義指導(dǎo)應(yīng)用數(shù)據(jù)分類,設(shè)計(jì)了應(yīng)用感知的索引結(jié)構(gòu),允許應(yīng)用數(shù)據(jù)內(nèi)部獨(dú)立并行地進(jìn)行重復(fù)數(shù)據(jù)刪除,并可以根據(jù)各類應(yīng)用數(shù)據(jù)的特點(diǎn)自適應(yīng)地選擇數(shù)據(jù)劃分策略和指紋計(jì)算函數(shù)。由于客戶端本地冗余檢測和云數(shù)據(jù)中心遠(yuǎn)程冗余檢測這兩種方法實(shí)現(xiàn)的源端消重策略在響應(yīng)延遲和系統(tǒng)開銷上互補(bǔ),將應(yīng)用感知的源端重復(fù)數(shù)據(jù)刪除分為客戶端的局部消重和云端的全局消重兩級來進(jìn)一步提高數(shù)據(jù)縮減率和減少消重處理時(shí)間。通過實(shí)驗(yàn)表明,ALG-Dedupe在極大提高重復(fù)數(shù)據(jù)刪除效率的同時(shí),有效地縮減了數(shù)據(jù)備份窗口和云存儲成本,降低了個(gè)人計(jì)算設(shè)備的能耗和系統(tǒng)開銷。 (2)設(shè)計(jì)了一種支持云數(shù)據(jù)中心實(shí)現(xiàn)大數(shù)據(jù)備份的可擴(kuò)展集群重復(fù)數(shù)據(jù)刪除方法E-Dedupe。該方法的新穎之處在于同時(shí)開發(fā)了數(shù)據(jù)局部性和相似性來優(yōu)化集群重復(fù)數(shù)據(jù)刪除。E-Dedupe結(jié)合集群節(jié)點(diǎn)間超塊級數(shù)據(jù)路由和節(jié)點(diǎn)內(nèi)塊級重復(fù)數(shù)據(jù)刪除處理,在提高數(shù)據(jù)縮減率的同時(shí),保持?jǐn)?shù)據(jù)訪問的局部性;通過擴(kuò)展Broder的最小值獨(dú)立置換理論,首次提出采用手紋技術(shù)來提高超塊相似度的檢測能力;通過節(jié)點(diǎn)的存儲空間利用率加權(quán)相似度,設(shè)計(jì)了基于手紋的有狀態(tài)超塊數(shù)據(jù)路由算法,將數(shù)據(jù)按超塊粒度從備份客戶端分配到各個(gè)重復(fù)數(shù)據(jù)刪除服務(wù)器節(jié)點(diǎn)。利用超塊手紋中的代表性數(shù)據(jù)塊指紋構(gòu)建相似索引,并結(jié)合容器管理機(jī)制和數(shù)據(jù)塊指紋緩存策略,以優(yōu)化數(shù)據(jù)塊指紋查詢性能。通過采用源端在線重復(fù)數(shù)據(jù)刪除技術(shù),備份客戶端可以避免向目標(biāo)路由節(jié)點(diǎn)傳輸超塊中的重復(fù)數(shù)據(jù)塊。通過大量實(shí)驗(yàn)表明,E-Dedupe能夠在獲得集群范圍內(nèi)高數(shù)據(jù)縮減率的同時(shí),有效地降低了系統(tǒng)通信開銷和內(nèi)存開銷,并保持各節(jié)點(diǎn)負(fù)載平衡。 (3)提出了一種基于集群重復(fù)數(shù)據(jù)刪除的虛擬桌面云存儲優(yōu)化技術(shù)。為支持可擴(kuò)展的虛擬桌面云服務(wù),虛擬桌面服務(wù)器集群需要管理大量桌面虛擬機(jī),本文通過開發(fā)虛擬機(jī)鏡像文件的語義信息,首次提出了基于語義感知的虛擬機(jī)調(diào)度算法來支持基于重復(fù)數(shù)據(jù)刪除的虛擬桌面集群存儲系統(tǒng)。同時(shí),結(jié)合服務(wù)器的數(shù)據(jù)塊緩存和本地混合存儲緩存,設(shè)計(jì)了基于重復(fù)數(shù)據(jù)刪除的虛擬桌面存儲I/O優(yōu)化策略。實(shí)驗(yàn)分析表明,基于重復(fù)數(shù)據(jù)刪除的虛擬桌面集群存儲優(yōu)化技術(shù)有效地提高了虛擬桌面存儲的空間利用率,降低了存儲系統(tǒng)的I/O操作數(shù),并改進(jìn)了虛擬桌面的啟動(dòng)速度。 通過上述幾項(xiàng)基于云環(huán)境中的重復(fù)數(shù)據(jù)刪除關(guān)鍵技術(shù)研究,我們?yōu)槲磥碓拼鎯驮朴?jì)算研究提供了有力的技術(shù)支撐。
[Abstract]:With the advent of the era of large data, the amount of data in the information world is explosively increasing, and the data storage and management requirements of the data center have reached PB level or even EB level. In order to restrain the rapid growth of data, improve the utilization rate of IT resources, reduce system energy consumption and management costs, duplicate data deletion technology, as a new data reduction technology, has become a research hotspot in academia and industry.
Cloud computing, as the key supporting technology of large data, optimizes resource utilization through network computing and virtualization technology to provide users with cheap, efficient and reliable computing and storage services. This paper mainly discusses how to optimize cloud backup service in personal computing environment, distributed cloud backup storage system in data center and virtual desktop cloud cluster storage system by using duplicate data deletion technology to improve IT resource utilization and system scalability, and reduce the number of users. According to the influence of weight-loss operation on I/O performance, this paper analyzes and studies cloud backup, large data backup and virtual desktop Cloud Applications Based on duplicate data deletion technology, and proposes new system design and algorithm.
(1) A hierarchical application-aware source-side duplicate data deletion mechanism ALG-Dedupe based on cloud backup service in personal computing environment is proposed. Through statistical analysis of a large number of personal application data, it is found for the first time that the amount of data shared between different types of application data sets can be neglected. An application-aware index structure is designed to allow applications to delete duplicate data independently and concurrently, and to select data partitioning strategies and fingerprint calculation functions adaptively according to the characteristics of various application data. The application-aware source-side duplicate data deletion is divided into two stages: local de-duplication on the client side and global de-duplication on the cloud side to further improve the data reduction rate and reduce the de-duplication processing time. It effectively reduces the cost of data backup window and cloud storage, and reduces the energy consumption and system overhead of personal computing devices.
(2) A scalable cluster duplicate data deletion method, E-Dedupe, is designed to support large data backup in cloud data center. The novel feature of this method is that both data locality and similarity are developed to optimize cluster duplicate data deletion. Deletion processing can not only improve the data reduction rate, but also maintain the locality of data access. By extending Broder's minimum value independent permutation theory, fingerprint technology is firstly used to improve the detection ability of superblock similarity. By using the weighted similarity of node storage space utilization, the stateful superblock data based on fingerprint is designed. Routing algorithm assigns data from the backup client to each duplicate data deletion server node according to the superblock granularity. Similar index is constructed by using the representative data block fingerprints in the superblock fingerprints, and the container management mechanism and the block fingerprint cache strategy are combined to optimize the performance of the data block fingerprint query. According to deletion technology, the backup client can avoid transferring duplicate data blocks to the target routing node. A large number of experiments show that E-Dedupe can achieve high data reduction rate within the cluster, effectively reduce the system communication overhead and memory overhead, and maintain the load balance of each node.
(3) A virtual desktop cloud storage optimization technology based on cluster duplicate data deletion is proposed. In order to support scalable virtual desktop cloud services, virtual desktop server clusters need to manage a large number of desktop virtual machines. In this paper, the virtual machine scheduling algorithm based on semantic awareness is proposed for the first time by developing semantic information of virtual machine mirror files. Meanwhile, a virtual desktop cluster storage optimization strategy based on duplicate data deletion is designed, which combines the server's data block cache with the local hybrid storage cache. The experimental results show that the duplicate data deletion based virtual desktop cluster storage optimization technology is effective. It improves the utilization ratio of virtual desktop storage space, reduces the I/O operation of storage system, and improves the start-up speed of virtual desktop.
Through the above several key technologies of data deletion based on cloud environment, we provide a strong technical support for future cloud storage and cloud computing research.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2013
【分類號】:TP309.3;TP333
本文編號:2203232
[Abstract]:With the advent of the era of large data, the amount of data in the information world is explosively increasing, and the data storage and management requirements of the data center have reached PB level or even EB level. In order to restrain the rapid growth of data, improve the utilization rate of IT resources, reduce system energy consumption and management costs, duplicate data deletion technology, as a new data reduction technology, has become a research hotspot in academia and industry.
Cloud computing, as the key supporting technology of large data, optimizes resource utilization through network computing and virtualization technology to provide users with cheap, efficient and reliable computing and storage services. This paper mainly discusses how to optimize cloud backup service in personal computing environment, distributed cloud backup storage system in data center and virtual desktop cloud cluster storage system by using duplicate data deletion technology to improve IT resource utilization and system scalability, and reduce the number of users. According to the influence of weight-loss operation on I/O performance, this paper analyzes and studies cloud backup, large data backup and virtual desktop Cloud Applications Based on duplicate data deletion technology, and proposes new system design and algorithm.
(1) A hierarchical application-aware source-side duplicate data deletion mechanism ALG-Dedupe based on cloud backup service in personal computing environment is proposed. Through statistical analysis of a large number of personal application data, it is found for the first time that the amount of data shared between different types of application data sets can be neglected. An application-aware index structure is designed to allow applications to delete duplicate data independently and concurrently, and to select data partitioning strategies and fingerprint calculation functions adaptively according to the characteristics of various application data. The application-aware source-side duplicate data deletion is divided into two stages: local de-duplication on the client side and global de-duplication on the cloud side to further improve the data reduction rate and reduce the de-duplication processing time. It effectively reduces the cost of data backup window and cloud storage, and reduces the energy consumption and system overhead of personal computing devices.
(2) A scalable cluster duplicate data deletion method, E-Dedupe, is designed to support large data backup in cloud data center. The novel feature of this method is that both data locality and similarity are developed to optimize cluster duplicate data deletion. Deletion processing can not only improve the data reduction rate, but also maintain the locality of data access. By extending Broder's minimum value independent permutation theory, fingerprint technology is firstly used to improve the detection ability of superblock similarity. By using the weighted similarity of node storage space utilization, the stateful superblock data based on fingerprint is designed. Routing algorithm assigns data from the backup client to each duplicate data deletion server node according to the superblock granularity. Similar index is constructed by using the representative data block fingerprints in the superblock fingerprints, and the container management mechanism and the block fingerprint cache strategy are combined to optimize the performance of the data block fingerprint query. According to deletion technology, the backup client can avoid transferring duplicate data blocks to the target routing node. A large number of experiments show that E-Dedupe can achieve high data reduction rate within the cluster, effectively reduce the system communication overhead and memory overhead, and maintain the load balance of each node.
(3) A virtual desktop cloud storage optimization technology based on cluster duplicate data deletion is proposed. In order to support scalable virtual desktop cloud services, virtual desktop server clusters need to manage a large number of desktop virtual machines. In this paper, the virtual machine scheduling algorithm based on semantic awareness is proposed for the first time by developing semantic information of virtual machine mirror files. Meanwhile, a virtual desktop cluster storage optimization strategy based on duplicate data deletion is designed, which combines the server's data block cache with the local hybrid storage cache. The experimental results show that the duplicate data deletion based virtual desktop cluster storage optimization technology is effective. It improves the utilization ratio of virtual desktop storage space, reduces the I/O operation of storage system, and improves the start-up speed of virtual desktop.
Through the above several key technologies of data deletion based on cloud environment, we provide a strong technical support for future cloud storage and cloud computing research.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2013
【分類號】:TP309.3;TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 ;VMStore: Distributed storage system for multiple virtual machines[J];Science China(Information Sciences);2011年06期
2 敖莉;舒繼武;李明強(qiáng);;重復(fù)數(shù)據(jù)刪除技術(shù)[J];軟件學(xué)報(bào);2010年05期
相關(guān)博士學(xué)位論文 前1條
1 魏建生;高性能重復(fù)數(shù)據(jù)檢測與刪除技術(shù)研究[D];華中科技大學(xué);2012年
,本文編號:2203232
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2203232.html
最近更新
教材專著