基于Hadoop分布式系統(tǒng)的重復(fù)數(shù)據(jù)檢測(cè)技術(shù)研究與應(yīng)用
發(fā)布時(shí)間:2018-05-29 10:26
本文選題:云計(jì)算 + Hadoop ; 參考:《湖南大學(xué)》2013年碩士論文
【摘要】:隨著信息技術(shù)的快速發(fā)展,云計(jì)算和重復(fù)數(shù)據(jù)刪除技術(shù)也得到了迅速的發(fā)展。云計(jì)算憑借其強(qiáng)大的分布式計(jì)算能力以及低成本高可靠性的優(yōu)勢(shì),在海量數(shù)據(jù)處理方面占據(jù)主導(dǎo)地位,但是Hadoop系統(tǒng)中數(shù)據(jù)進(jìn)行歸檔時(shí),存在大量重復(fù)數(shù)據(jù),影響系統(tǒng)的處理效率。重復(fù)數(shù)據(jù)刪除技術(shù)是一種熱門(mén)的存儲(chǔ)技術(shù),可對(duì)存儲(chǔ)容量進(jìn)行優(yōu)化,很大程度上減少對(duì)物理存儲(chǔ)空間的浪費(fèi),從而滿(mǎn)足日益增長(zhǎng)的數(shù)據(jù)存儲(chǔ)需求。因此,云計(jì)算和重復(fù)數(shù)據(jù)刪除技術(shù)的結(jié)合將會(huì)是一個(gè)雙贏的解決方案。 針對(duì)以上問(wèn)題,本文分析了當(dāng)前云計(jì)算平臺(tái)Hadoop和重復(fù)數(shù)據(jù)刪除技術(shù)的特點(diǎn)后,利用Hadoop分布式平臺(tái)來(lái)管理海量數(shù)據(jù)。同時(shí),針對(duì)Hadoop系統(tǒng)中存在的大量重復(fù)數(shù)據(jù),本文提出來(lái)一種基于重復(fù)數(shù)據(jù)刪除技術(shù)的去重檢測(cè)技術(shù),利用指紋算法BLAKE生成數(shù)據(jù)塊指紋,采用基于數(shù)據(jù)塊級(jí)的刪除粒度,使用In-line方式有效刪除重復(fù)數(shù)據(jù)。 哈希SHA-3算法憑借其在數(shù)據(jù)運(yùn)算上的優(yōu)勢(shì),得到業(yè)界的認(rèn)可,,本文首次采用SHA-3候選算法BLAKE作為重復(fù)數(shù)據(jù)檢測(cè)技術(shù)中的指紋函數(shù),取代了原始的重復(fù)數(shù)據(jù)指紋算法MD5,進(jìn)行重復(fù)數(shù)據(jù)指紋的生成和指紋匹配,并單獨(dú)對(duì)該算法進(jìn)行詳細(xì)的軟件設(shè)計(jì)和實(shí)現(xiàn),實(shí)驗(yàn)性能比傳統(tǒng)指紋算法MD5有了很大的提高。 最后將本文的研究應(yīng)用到車(chē)聯(lián)網(wǎng)中,利用Hadoop存儲(chǔ)管理大規(guī)模車(chē)聯(lián)網(wǎng)數(shù)據(jù)。根據(jù)HBase數(shù)據(jù)模型的特點(diǎn),設(shè)計(jì)了交通數(shù)據(jù)的分布式數(shù)據(jù)存儲(chǔ)模型,其中詳細(xì)給出了主表和反向表的設(shè)計(jì),一定程度上滿(mǎn)足用戶(hù)的條件查詢(xún)。并利用重復(fù)數(shù)據(jù)刪除技術(shù)對(duì)車(chē)聯(lián)網(wǎng)歸檔時(shí)存在的重復(fù)數(shù)據(jù)進(jìn)行去重檢測(cè),通過(guò)對(duì)三組汽車(chē)終端數(shù)據(jù)集進(jìn)行實(shí)驗(yàn),給出詳細(xì)性能分析,大大降低了硬盤(pán)存儲(chǔ)消耗,提高了存儲(chǔ)效率,消除了數(shù)據(jù)存儲(chǔ)冗余。
[Abstract]:With the rapid development of information technology, cloud computing and duplicate data deletion technology have also been rapidly developed. Cloud computing plays a dominant role in mass data processing because of its powerful distributed computing power and the advantages of low cost and high reliability. However, when archiving data in Hadoop system, there are a lot of duplicate data. Affect the processing efficiency of the system. Repetitive data deletion is a popular storage technology, which can optimize storage capacity, reduce the waste of physical storage space to a great extent, and meet the increasing demand for data storage. Therefore, the combination of cloud computing and duplicate data deletion technology will be a win-win solution. In view of the above problems, this paper analyzes the characteristics of the current cloud computing platform Hadoop and repeated data deletion technology, and uses the Hadoop distributed platform to manage the massive data. At the same time, aiming at the existence of a large number of repeated data in Hadoop system, this paper proposes a kind of de-re-detection technology based on repeated data deletion technology. The fingerprint algorithm BLAKE is used to generate data block fingerprint, and the deletion granularity based on data block level is adopted. Delete duplicate data effectively using In-line. Hash SHA-3 algorithm is recognized by the industry because of its advantage in data operation. In this paper, SHA-3 candidate algorithm BLAKE is first used as fingerprint function in repetitive data detection technology. Instead of the original repeated data fingerprint algorithm (MD5), the algorithm is used to generate and match the repeated data fingerprint, and the algorithm is designed and implemented in detail. The experimental performance is greatly improved than that of the traditional fingerprint algorithm MD5. Finally, the research is applied to vehicle networking, and Hadoop is used to store and manage large scale vehicle networking data. According to the characteristics of HBase data model, the distributed data storage model of traffic data is designed, in which the design of main table and reverse table are given in detail. And the repeated data delete technology is used to detect the duplicate data existing in the vehicle network archiving. Through the experiment of three groups of vehicle terminal data sets, the detailed performance analysis is given, which greatly reduces the storage consumption of hard disk. The storage efficiency is improved and the redundancy of data storage is eliminated.
【學(xué)位授予單位】:湖南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 楊義先;姚文斌;陳釗;;信息系統(tǒng)災(zāi)備技術(shù)綜論[J];北京郵電大學(xué)學(xué)報(bào);2010年02期
2 孫健;賈曉菁;;Google云計(jì)算平臺(tái)的技術(shù)架構(gòu)及對(duì)其成本的影響研究[J];電信科學(xué);2010年01期
3 劉琦琳;;IBM云計(jì)算:從理想到實(shí)踐[J];互聯(lián)網(wǎng)周刊;2009年11期
4 孫牧;;云端的小飛象—Hadoop[J];程序員;2008年10期
5 張硯波;劉正偉;文中領(lǐng);王永海;;一種高效存儲(chǔ)解決方案的分析與研究[J];計(jì)算機(jī)研究與發(fā)展;2012年S1期
6 陸游游;敖莉;舒繼武;;一種基于重復(fù)數(shù)據(jù)刪除的備份系統(tǒng)[J];計(jì)算機(jī)研究與發(fā)展;2012年S1期
7 張曼;李弼程;林琛;;基于SHA-1的郵件去重算法[J];計(jì)算機(jī)工程;2008年11期
8 王珊;王會(huì)舉;覃雄派;周p
本文編號(hào):1950538
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1950538.html
最近更新
教材專(zhuān)著