面向農(nóng)業(yè)科學(xué)數(shù)據(jù)的分布式存儲(chǔ)系統(tǒng)的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-04-26 10:04
本文選題:農(nóng)業(yè)科學(xué)數(shù)據(jù) + 分布式存儲(chǔ)。 參考:《北京工業(yè)大學(xué)》2015年碩士論文
【摘要】:農(nóng)業(yè)科學(xué)數(shù)據(jù)存儲(chǔ)是農(nóng)業(yè)科學(xué)研究的重要部分,F(xiàn)有農(nóng)業(yè)存儲(chǔ)系統(tǒng)在性能、存儲(chǔ)容量、數(shù)據(jù)的可靠性、存儲(chǔ)成本等方面存在很大的不足。為了解決農(nóng)業(yè)科學(xué)數(shù)據(jù)的PB級(jí)非結(jié)構(gòu)化且形式多樣的數(shù)據(jù)存儲(chǔ)難題,本文對(duì)農(nóng)業(yè)科學(xué)數(shù)據(jù)文件進(jìn)行深入分析,并展開(kāi)對(duì)分布式存儲(chǔ)技術(shù)的研究,提出了基于開(kāi)源云平臺(tái)Hadoop的分布式存儲(chǔ)系統(tǒng)的解決方案。取得的主要成果如下:1)根據(jù)農(nóng)業(yè)科學(xué)數(shù)據(jù)的自身特點(diǎn)與應(yīng)用需求,本文設(shè)計(jì)了面向農(nóng)業(yè)科學(xué)大數(shù)據(jù)的分布式存儲(chǔ)系統(tǒng)的框架模型。該模型將非結(jié)構(gòu)化的文件數(shù)據(jù)存入改進(jìn)的HDFS架構(gòu)中,將異構(gòu)、結(jié)構(gòu)化的屬性數(shù)據(jù)存入HBase數(shù)據(jù)庫(kù)系統(tǒng),給出了保證數(shù)據(jù)文件與數(shù)據(jù)屬性之間的關(guān)聯(lián)性的設(shè)計(jì)方案,并且在Client端與數(shù)據(jù)節(jié)點(diǎn)端設(shè)置緩存,提高了文件的存取效率。2)面對(duì)農(nóng)業(yè)科學(xué)數(shù)據(jù)中含有海量小文件的情況,本文給出了基于多屬性的海量農(nóng)業(yè)科學(xué)小文件合并存儲(chǔ)策略。通過(guò)將農(nóng)業(yè)科學(xué)數(shù)據(jù)中的小文件按照特定屬性進(jìn)行分類,將屬于同一分類的數(shù)據(jù)合并成一個(gè)大的聚合文件,有效的降低了海量小文件對(duì)中心節(jié)點(diǎn)內(nèi)存的消耗,提高了文件的存取效率;通過(guò)創(chuàng)建并緩存了小文件到聚合文件的索引,改善系統(tǒng)中農(nóng)業(yè)科學(xué)數(shù)據(jù)讀取的性能。3)針對(duì)農(nóng)業(yè)科學(xué)數(shù)據(jù)文件因季節(jié)性強(qiáng)而導(dǎo)致的熱點(diǎn)數(shù)據(jù)問(wèn)題,提出了動(dòng)態(tài)副本管理策略,包括兩個(gè)方面的內(nèi)容:一方面,基于文件訪問(wèn)頻率的動(dòng)態(tài)副本添加和刪除方法,通過(guò)統(tǒng)計(jì)文件在固定的時(shí)間內(nèi)訪問(wèn)頻率,計(jì)算出文件使用的熱度,并綜合考慮統(tǒng)計(jì)周期、文件緩存等因素,動(dòng)態(tài)調(diào)整文件副本的數(shù)量;另一方面,基于節(jié)點(diǎn)狀態(tài)的副本動(dòng)態(tài)放置方法,通過(guò)全面的考慮描述數(shù)據(jù)節(jié)點(diǎn)狀態(tài)的多個(gè)參數(shù),計(jì)算每個(gè)節(jié)點(diǎn)的性能,選擇最優(yōu)的存放節(jié)點(diǎn),以改善系統(tǒng)性能以及文件讀取效率;谏鲜鲅芯砍晒,本文設(shè)計(jì)并實(shí)現(xiàn)了面向農(nóng)業(yè)科學(xué)大數(shù)據(jù)的分布式存儲(chǔ)系統(tǒng)AGRFS。AGRFS實(shí)現(xiàn)了基本功能模塊以及用戶訪問(wèn)接口,并且搭建了一個(gè)Hadoop集群,通過(guò)實(shí)驗(yàn)來(lái)驗(yàn)證了上述策略的可行性以及本系統(tǒng)的可用性。結(jié)果表明,本文提出的小文件存儲(chǔ)策略以及動(dòng)態(tài)副本管理策略提高了小文件的讀寫操作效率,優(yōu)化了系統(tǒng)的性能,同時(shí)本文設(shè)計(jì)的分布存儲(chǔ)系統(tǒng)也能很好解決農(nóng)業(yè)科學(xué)數(shù)據(jù)存儲(chǔ)問(wèn)題。
[Abstract]:Agricultural science data storage is an important part of agricultural science research. The existing agricultural storage system has great shortcomings in performance, storage capacity, data reliability, storage cost and so on. In order to solve the problem of unstructured and diverse data storage in PB level of agricultural scientific data, this paper makes a deep analysis of agricultural scientific data files, and develops the research on distributed storage technology. This paper presents a solution of distributed storage system based on open source cloud platform Hadoop. The main achievements are as follows: (1) according to the characteristics and application requirements of agricultural scientific data, this paper designs a framework model of distributed storage system for agricultural science big data. In this model, the unstructured file data is stored in the improved HDFS architecture, the heterogeneous and structured attribute data is stored in the HBase database system, and the design scheme to ensure the relationship between the data file and the data attribute is given. And the cache is set in the Client and the data node to improve the file access efficiency. 2) in the face of the large amount of small files in the agricultural science data, this paper presents a multi-attribute based storage strategy for the large amount of small files in agricultural science. By classifying small files in agricultural scientific data according to specific attributes, the data belonging to the same classification can be merged into a large aggregate file, which effectively reduces the memory consumption of large amounts of small files to the central node. Improve the efficiency of file access; improve the performance of agricultural science data reading in the system by creating and caching the index of small files to aggregate files.) aiming at the hot data problems caused by the seasonality of agricultural science data files, A dynamic replica management strategy is proposed, which includes two aspects: on the one hand, the method of adding and deleting dynamic replicas based on file access frequency is proposed. On the other hand, the dynamic placement method of replica based on node state is used to describe several parameters of data node state, such as statistical period, file cache and other factors, and dynamically adjusts the number of file replicas, on the other hand, the dynamic placement method of replica based on node state is considered comprehensively. The performance of each node is calculated and the optimal storage node is selected to improve system performance and file reading efficiency. Based on the above research results, this paper designs and implements the basic function module and user access interface of AGRFS.AGRFS, a distributed storage system for agricultural science big data, and builds a Hadoop cluster. The feasibility of the strategy and the availability of the system are verified by experiments. The results show that the small file storage strategy and the dynamic copy management strategy proposed in this paper can improve the reading and writing efficiency of small files and optimize the performance of the system. At the same time, the distributed storage system designed in this paper can also solve the problem of agricultural science data storage.
【學(xué)位授予單位】:北京工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP333
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 何公明;張?jiān)獫?;面向數(shù)字媒體的高性能分布式存儲(chǔ)系統(tǒng)的研究與應(yīng)用[J];廣播電視信息;2009年10期
2 范劍波,郭建康;分布式存儲(chǔ)系統(tǒng)性能模型的建立與應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2001年13期
3 范劍波,徐利浩;分布式存儲(chǔ)系統(tǒng)可靠性的研究[J];計(jì)算機(jī)工程;2001年06期
4 吳英;謝廣軍;劉t,
本文編號(hào):1805558
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1805558.html
最近更新
教材專著