海量數(shù)據(jù)小文件分布式存儲(chǔ)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-04-10 14:18
本文選題:海量小文件 + 文件系統(tǒng) ; 參考:《湖南大學(xué)》2013年碩士論文
【摘要】:近年,由于互聯(lián)網(wǎng)的發(fā)展,導(dǎo)致海量信息的傳輸和存儲(chǔ)的場(chǎng)景日益增多,在這種背景下,數(shù)據(jù)存儲(chǔ)技術(shù)也得到了快速發(fā)展。由于互聯(lián)網(wǎng)的信息以海量小文件居多,所以作為海量小文件存儲(chǔ)技術(shù)的一個(gè)重要研究方向,分布式文件系統(tǒng)是當(dāng)今的研究熱點(diǎn)。目前,在分布式文件系統(tǒng)中存儲(chǔ)海量小文件時(shí),還普遍存在著存儲(chǔ)性能不高、存儲(chǔ)空間利用率低、性能瓶頸及單點(diǎn)故障等問(wèn)題,因此,如何解決目前海量小文件數(shù)據(jù)的存儲(chǔ)和傳輸中存在的諸多實(shí)際問(wèn)題,是當(dāng)前計(jì)算機(jī)存儲(chǔ)技術(shù)研究領(lǐng)域中非常重要的工作。 首先,針對(duì)上述問(wèn)題,本文提出了一種在單個(gè)數(shù)據(jù)節(jié)點(diǎn)中存儲(chǔ)海量小文件的數(shù)據(jù)分塊方案。在該方案中,對(duì)小文件的概念及算法進(jìn)行了描述,并定義了文件塊的塊內(nèi)利用率,塊內(nèi)相關(guān)率及塊間相關(guān)率三個(gè)指標(biāo),根據(jù)這三個(gè)指標(biāo),可以對(duì)每個(gè)文件塊中小文件分布的情況進(jìn)行量化的考核,再衡量文件塊對(duì)于查詢數(shù)據(jù)的影響,最后可以有針對(duì)性的進(jìn)行優(yōu)化。 其次,提出了一種給予小文件存儲(chǔ)的數(shù)據(jù)副本數(shù)確定算法。這種算法以小文件副本所在的數(shù)據(jù)節(jié)點(diǎn)可靠性為參數(shù),,該參數(shù)能夠快速確定小文件的可靠性,系統(tǒng)可以根據(jù)此可靠性來(lái)決定當(dāng)前的小文件副本數(shù)量是否滿足要求。在此基礎(chǔ)上,提出了一種靈活的小文件副本弱一致性維護(hù)方案。 第三,在分析海量小文件分布式存儲(chǔ)系統(tǒng)的功能和性能需求的基礎(chǔ)上,提出了整個(gè)小文件存儲(chǔ)及管理系統(tǒng)的框架,該框架主要從數(shù)據(jù)節(jié)點(diǎn)DataNode、數(shù)據(jù)管理服務(wù)器DataServer、文件塊倒排表、文件倒排表與目錄的管理、相應(yīng)的API函數(shù)等四個(gè)主要方面對(duì)海量小文件分布式存儲(chǔ)進(jìn)行了設(shè)計(jì)和實(shí)現(xiàn)。 最后,為了評(píng)估系統(tǒng)的整體性能,對(duì)系統(tǒng)進(jìn)行了測(cè)試。通過(guò)分析與測(cè)試一些關(guān)鍵性指標(biāo)與性能,得出整個(gè)系統(tǒng)的性能基本達(dá)到設(shè)計(jì)要求,能夠滿足實(shí)際環(huán)境的要求的結(jié)論。
[Abstract]:In recent years, due to the development of the Internet, there are more and more scenes of mass information transmission and storage. In this context, data storage technology has also been rapidly developed.Distributed file system (DFS), as an important research direction of storage technology of large amount of small files, is one of the most popular research fields because of the large amount of small files on the Internet.At present, when storing large amount of small files in distributed file system, there are still some problems such as low storage performance, low utilization of storage space, performance bottleneck and single point failure, etc.How to solve many practical problems existing in the storage and transmission of large amounts of small file data is a very important work in the field of computer storage technology.Firstly, in order to solve the above problems, this paper proposes a data partitioning scheme for storing large amounts of small files in a single data node.In this scheme, the concept and algorithm of small files are described, and three indexes of the intra-block utilization ratio, intra-block correlation rate and inter-block correlation rate of the file block are defined.The distribution of small and medium files in each file block can be evaluated quantitatively, then the impact of file block on query data can be measured. Finally, the optimization can be carried out pertinently.Secondly, an algorithm for determining the number of copies of data stored in small files is proposed.This algorithm takes the reliability of the data node in which the small file copy is located as a parameter, and the parameter can quickly determine the reliability of the small file, according to which the system can determine whether the current number of small file replicas meets the requirements.On this basis, a flexible weak consistency maintenance scheme for small file replicas is proposed.Thirdly, on the basis of analyzing the function and performance requirement of the massive small file distributed storage system, this paper puts forward the framework of the whole small file storage and management system. The framework mainly consists of data node data Node, data management server data Server, file block inverted table.Four main aspects of file inverted table and directory management, corresponding API function, etc., are designed and implemented for distributed storage of large amount of small files.Finally, in order to evaluate the overall performance of the system, the system was tested.By analyzing and testing some key indexes and performance, it is concluded that the performance of the whole system basically meets the design requirements and can meet the requirements of the actual environment.
【學(xué)位授予單位】:湖南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP333
【參考文獻(xiàn)】
中國(guó)期刊全文數(shù)據(jù)庫(kù) 前4條
1 程瑩;張?jiān)朴?徐雷;房秉毅;;基于Hadoop及關(guān)系型數(shù)據(jù)庫(kù)的海量數(shù)據(jù)分析研究[J];電信科學(xué);2010年11期
2 楊希;趙躍龍;周云霞;;智能網(wǎng)絡(luò)磁盤集群負(fù)載平衡研究[J];計(jì)算機(jī)工程與應(yīng)用;2011年04期
3 欒亞建;黃爛
本文編號(hào):1731554
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1731554.html
最近更新
教材專著