面向歸檔存儲的重復(fù)數(shù)據(jù)刪除優(yōu)化方法研究
發(fā)布時間:2018-06-05 10:01
本文選題:重復(fù)數(shù)據(jù)刪除 + 分布式存儲; 參考:《華中科技大學(xué)》2013年碩士論文
【摘要】:隨著社會信息化水平的提高,數(shù)據(jù)變得越來越重要。與此同時,企業(yè)數(shù)據(jù)中心的存儲需求量呈爆炸式增長。目前的存儲系統(tǒng)主要是從數(shù)據(jù)的讀寫性能和可靠性方面進(jìn)行設(shè)計,忽略了數(shù)據(jù)之間的關(guān)聯(lián)和冗余特性。這不僅造成了存儲空間的浪費,也使得用戶難以對數(shù)量龐大、結(jié)構(gòu)復(fù)雜的數(shù)據(jù)進(jìn)行有效的管理。針對此,近年來出現(xiàn)了重復(fù)數(shù)據(jù)刪除技術(shù)(De-duplication)。 在分析重復(fù)數(shù)據(jù)刪除系統(tǒng)中元數(shù)據(jù)訪問、查詢特性和數(shù)據(jù)的布局及讀寫特性的基礎(chǔ)上,給出了一種元數(shù)據(jù)與數(shù)據(jù)分離的重復(fù)數(shù)據(jù)刪除系統(tǒng)架構(gòu)方案:(1)采用由客戶端、元數(shù)據(jù)服務(wù)器和存儲節(jié)點構(gòu)成的三方架構(gòu);(2)將元數(shù)據(jù)訪問分離到客戶端與元數(shù)據(jù)服務(wù)器間,將文件內(nèi)容訪問分離到客戶端與存儲節(jié)點間,從而該方案具有高可擴(kuò)展性和高訪問并發(fā)性。在去重功能上,(1)采用固定分塊的數(shù)據(jù)劃分方法,使用哈希算法MD5、SHA-1等作為數(shù)據(jù)分塊的哈希指紋;(2)使用兩層Bloom Filter對數(shù)據(jù)分塊的哈希指紋進(jìn)行快速判別和過濾,并使用B+樹索引結(jié)構(gòu)作為哈希指紋元數(shù)據(jù)的持久化存儲方案。為了進(jìn)一步優(yōu)化I/O性能,(1)采用按照數(shù)據(jù)流分區(qū)域存儲的數(shù)據(jù)布局策略,獲得數(shù)據(jù)訪問的空間局部性;(2)結(jié)合客戶端元數(shù)據(jù)及數(shù)據(jù)緩存機(jī)制,提高文件訪問的緩存命中率和文件讀寫的性能。 最后,設(shè)計并實現(xiàn)了一個三方架構(gòu)的重復(fù)數(shù)據(jù)刪除系統(tǒng)原型,在系統(tǒng)原型之上進(jìn)行了功能和性能測試。功能測試結(jié)果表明,上述重復(fù)數(shù)據(jù)刪除方案在虛擬機(jī)鏡像的測試集下能獲得130%的數(shù)據(jù)壓縮率;性能測試結(jié)果表明,緩存機(jī)制可以提高文件訪問的性能;指紋過濾統(tǒng)計表明,采用的兩層Bloom Filter具有較高的指紋過濾率,0.071%的實際誤判率在0.1%的理論誤判率所允許的范圍內(nèi)。
[Abstract]:With the improvement of the level of social information, data become more and more important. At the same time, the enterprise data center storage demand is explosive growth. The current storage system is designed mainly from the aspects of read and write performance and reliability of data, ignoring the correlation and redundancy between data. This not only causes the waste of storage space, but also makes it difficult for users to manage the huge and complicated data effectively. In view of this, in recent years, the repeated data delete technology has appeared De-duplex replication. Based on the analysis of metadata access, query characteristics, data layout, read and write characteristics in the repetitive data deletion system, a scheme of metadata separation from data deletion system architecture: 1) is presented. The tripartite architecture of metadata server and storage node separates metadata access between client and metadata server, and file content access between client and storage node. Therefore, the scheme has high scalability and high access concurrency. (1) using fixed block data partition method, and using hash algorithm MD5SHA-1 as data block hashing fingerprint / 2) using two-layer Bloom Filter to quickly distinguish and filter the hash fingerprint of data partitioning. B tree index structure is used as the persistent storage scheme of hash fingerprint metadata. In order to optimize I / O performance further, the spatial locality of data access is obtained by using the data layout strategy which is stored according to the area of data flow) and the mechanism of client metadata and data cache is combined. Improve file access cache hit rate and file read and write performance. Finally, we design and implement a prototype of repetitive data deletion system based on the three-party architecture, and test the function and performance of the system on top of the prototype. The function test results show that the data compression ratio of the proposed duplicate data deletion scheme can reach 130% under the virtual machine image test set, and the performance test results show that the cache mechanism can improve the performance of file access, and fingerprint filtering statistics show that, The two-layer Bloom Filter has a high fingerprint filtering rate of 0.071% and the actual error rate is within 0.1% of the theoretical error rate.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333
【參考文獻(xiàn)】
相關(guān)博士學(xué)位論文 前1條
1 吳偉;海量存儲系統(tǒng)元數(shù)據(jù)管理的研究[D];華中科技大學(xué);2010年
,本文編號:1981591
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1981591.html
最近更新
教材專著