HDFS下文件存儲研究與優(yōu)化
本文選題:云存儲 + Hadoop ; 參考:《廣東工業(yè)大學(xué)》2013年碩士論文
【摘要】:近年來云計(jì)算得到廣泛的研究與應(yīng)用,并迅速成為計(jì)算機(jī)領(lǐng)域最為熱門的話題。云存儲是在云計(jì)算概念基礎(chǔ)上延伸和發(fā)展出來的一個新概念,其中又以Hadoop框架的HDFS存儲系統(tǒng)最為著名。研究發(fā)現(xiàn),網(wǎng)絡(luò)中存在大量的重復(fù)數(shù)據(jù),數(shù)據(jù)的重復(fù)存儲會對空間造成極大浪費(fèi);而且小文件數(shù)量眾多,加之讀寫請求頻繁,所有的請求都由HDFS系統(tǒng)中唯一的NameNode進(jìn)行處理,會導(dǎo)致整個系統(tǒng)性能急劇下降。 論文首先對Hadoop系統(tǒng)架構(gòu)及實(shí)現(xiàn)技術(shù)進(jìn)行了全面分析,并介紹了重復(fù)數(shù)據(jù)刪除相關(guān)技術(shù),同時分析了HDFS在處理大量小文件時存在的不足,為論文的下一步研究提供理論依據(jù)。 本文在傳統(tǒng)HDFS體系架構(gòu)的基礎(chǔ)上,提出了一種新的HDFS體系架構(gòu),并對元數(shù)據(jù)管理和文件操作流程進(jìn)行了設(shè)計(jì)。針對網(wǎng)絡(luò)中存在大量重數(shù)據(jù)及小文件的問題,分別設(shè)計(jì)了相應(yīng)的處理策略。本文的主要研究內(nèi)容和創(chuàng)新點(diǎn)如下: (1)基于傳統(tǒng)的HDFS提出了一種新的HDFS體系架構(gòu),即在每個機(jī)架新增一臺NameNode負(fù)責(zé)本機(jī)架事務(wù)的處理。分析了主NameNode和機(jī)架內(nèi)NameNode元數(shù)據(jù)緩存及恢復(fù)機(jī)制,并對文件操作的元數(shù)據(jù)獲取過程進(jìn)行了重新設(shè)計(jì)。 (2)針對重復(fù)數(shù)據(jù)的問題,本文采用雙重認(rèn)證的方式。首先設(shè)計(jì)了關(guān)鍵詞提取策略,對提取結(jié)果進(jìn)行哈希計(jì)算,在此基礎(chǔ)上結(jié)合文本相似匹配技術(shù)完成重復(fù)數(shù)據(jù)的判定。此策略避免了固定長度分塊重復(fù)數(shù)據(jù)刪除技術(shù)的弊端,對重復(fù)數(shù)據(jù)的判定更加智能化,在節(jié)省存儲空間的同時加強(qiáng)了重復(fù)數(shù)據(jù)刪除的準(zhǔn)確性和科學(xué)性。 (3)針對小文件的處理,結(jié)合小文件合并方案,對元數(shù)據(jù)的結(jié)構(gòu)、緩存內(nèi)容以及更新機(jī)制進(jìn)行了分析。同時,對小文件讀、寫和刪除操作流程進(jìn)行了詳細(xì)分析設(shè)計(jì)。由于將小文件進(jìn)行合并,節(jié)省了系統(tǒng)存儲空間,且機(jī)架內(nèi)NameNode完成了本機(jī)架內(nèi)大部分請求的處理,有效緩解了主NameNode負(fù)擔(dān),從而進(jìn)一步優(yōu)化了系統(tǒng)性能。 根據(jù)設(shè)計(jì)方案,文章最后進(jìn)行了相應(yīng)的仿真實(shí)驗(yàn),從實(shí)驗(yàn)結(jié)果可以看出,本文的設(shè)計(jì)在重復(fù)數(shù)據(jù)刪除的準(zhǔn)確性和科學(xué)性、小文件I/O速度及NameNode內(nèi)存使用率與CPU使用率等方面的性能都有不同程度地提升,從而說明了設(shè)計(jì)的有效性和科學(xué)性。
[Abstract]:Cloud computing has been widely studied and applied in recent years, and has quickly become the hottest topic in computer field. Cloud storage is a new concept extended and developed on the basis of cloud computing concept, among which HDFS storage system of Hadoop framework is the most famous. The study found that there are a lot of duplicate data in the network, and the repeated storage of the data will cause a great waste of space; moreover, the large number of small files and frequent requests for reading and writing, all requests are handled by the unique name Node in the HDFS system. It can lead to a sharp decline in the performance of the entire system. Firstly, the architecture and implementation technology of Hadoop system are analyzed, and the related techniques of repeated data deletion are introduced. At the same time, the shortcomings of HDFS in dealing with a large number of small files are analyzed, which provides a theoretical basis for the next research of this paper. Based on the traditional HDFS architecture, this paper proposes a new HDFS architecture, and designs the metadata management and file operation flow. Aiming at the problem of large amount of heavy data and small files in the network, the corresponding processing strategies are designed. The main contents and innovations of this paper are as follows: (1) A new HDFS architecture based on traditional HDFS is proposed, in which a new NameNode is added to each rack to handle the native rack transaction. This paper analyzes the cache and recovery mechanism of the main NameNode and the NameNode metadata in the rack, and redesigns the metadata acquisition process of the file operation. (2) aiming at the problem of repeated data, this paper adopts the method of double authentication. First, the keyword extraction strategy is designed, and the hash calculation of the extracted results is carried out. On this basis, the duplicate data is judged by combining the text similarity matching technique. This strategy avoids the drawback of the fixed length block repeat data deletion technology, and it is more intelligent to judge the repeated data. While saving storage space, the accuracy and scientificalness of duplicate data deletion are strengthened. (3) the structure, cache content and update mechanism of metadata are analyzed according to the processing of small files, combined with the scheme of small file merging. At the same time, the operation flow of reading, writing and deleting small files is analyzed and designed in detail. Because the small files are merged, the storage space of the system is saved, and the NameNode in the rack completes the processing of most requests in the native rack, which effectively alleviates the burden of the main NameNode and further optimizes the system performance. According to the design scheme, the paper carries on the corresponding simulation experiment at the end, from the experimental result, we can see that the design of this paper is accurate and scientific in the duplicate data deletion. The performance of small file I / O speed, NameNode memory usage and CPU usage are improved to some extent, which shows that the design is effective and scientific.
【學(xué)位授予單位】:廣東工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 王峰;雷葆華;;Hadoop分布式文件系統(tǒng)的模型分析[J];電信科學(xué);2010年12期
2 李成華;張新訪;金海;向文;;MapReduce:新型的分布式并行計(jì)算編程模型[J];計(jì)算機(jī)工程與科學(xué);2011年03期
3 程嵐嵐,何丕廉,孫越恒;基于樸素貝葉斯模型的中文關(guān)鍵詞提取算法研究[J];計(jì)算機(jī)應(yīng)用;2005年12期
4 郭慶琳;李艷梅;唐琦;;基于VSM的文本相似度計(jì)算的研究[J];計(jì)算機(jī)應(yīng)用研究;2008年11期
5 陳康;鄭緯民;;云計(jì)算:系統(tǒng)實(shí)例與研究現(xiàn)狀[J];軟件學(xué)報;2009年05期
6 張紅鷹;;中文文本關(guān)鍵詞提取算法[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2009年08期
相關(guān)碩士學(xué)位論文 前4條
1 李寬;基于HDFS的分布式Namenode節(jié)點(diǎn)模型的研究[D];華南理工大學(xué);2011年
2 李書鵬;分布式文件系統(tǒng)在云存儲環(huán)境下的若干問題研究[D];中國科學(xué)技術(shù)大學(xué);2011年
3 黃曉云;基于HDFS的云存儲服務(wù)系統(tǒng)研究[D];大連海事大學(xué);2010年
4 張密密;MapReduce模型在Hadoop實(shí)現(xiàn)中的性能分析及改進(jìn)優(yōu)化[D];電子科技大學(xué);2010年
,本文編號:2070416
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2070416.html