針對HBase的MapReduce數(shù)據(jù)訪問方式的優(yōu)化
發(fā)布時間:2018-11-01 16:28
【摘要】:隨著信息技術(shù)的飛速發(fā)展,互聯(lián)網(wǎng)上的數(shù)據(jù)量快速增長,數(shù)據(jù)種類也多種多樣,世界已經(jīng)轉(zhuǎn)移到以數(shù)據(jù)為中心的范式上——“大數(shù)據(jù)”時代。傳統(tǒng)的數(shù)據(jù)處理技術(shù)主要采用數(shù)據(jù)庫管理模式,在面對大數(shù)據(jù)時存在存儲空間不易擴展和查詢效率低下的問題,越來越無法滿足人們高效處理數(shù)據(jù)的要求。越來越多企業(yè)把目光投向開源的Hadoop云平臺,使用HBase來存儲和管理數(shù)據(jù)。HBase中數(shù)據(jù)讀取可以使用MapReduce框架來完成并行化,從而在處理速度上比傳統(tǒng)的數(shù)據(jù)庫管理方式有了較大提高,然而在此框架下HBase數(shù)據(jù)讀取的速度仍然無法趕上數(shù)據(jù)處理的速度,問題主要在于HBase的MapReduce數(shù)據(jù)訪問方式無法完全保證數(shù)據(jù)的本地性。 本文首先介紹大數(shù)據(jù)的相關(guān)知識,包括大數(shù)據(jù)存儲技術(shù)和大數(shù)據(jù)處理技術(shù),概述了云計算的分類、特點和主要平臺,著重研究了當前應用最廣泛的Hadoop云平臺的三種關(guān)鍵技術(shù),HDFS、MapReduce和HBase。從而為分析和改進HBase的MapReduce過程提供了理論依據(jù)。 然后通過深入分析HBase中MapReduce框架的任務分配流程、數(shù)據(jù)分片過程和數(shù)據(jù)讀取接口(Scan)的工作流程,找到了HBase進行MapRedcue計算的瓶頸:1)任務無法做到本地;2) Region中數(shù)據(jù)讀取是串行的;3)數(shù)據(jù)需要進行一次合并組成一條記錄。針對上述問題,本文提出了一種改進方法,該方法不以原來的邏輯存儲單元Region作為任務分配的基本單位,而是以HBase的物理存儲單元Block作為任務分配的基本單位;重新設(shè)計了數(shù)據(jù)分片讀取方法;采用華中杰提出了基于本地任務優(yōu)先的MapReduce的調(diào)度策略。 最后通過對比實驗證明:改進后的接口取消Scan接口的額外處理工作,加強了數(shù)據(jù)的本地性,使得訪問數(shù)據(jù)所花費的時間減少為原來接口的1/10,很好的節(jié)省了工作時間,,從而有效的提高了工作效率。
[Abstract]:With the rapid development of information technology, the amount of data on the Internet is growing rapidly, and the types of data are also varied. The world has been transferred to the data-centered paradigm "big data" era. The traditional data processing technology mainly adopts the database management mode. Facing big data, the storage space is not easy to expand and the query efficiency is low, which is more and more unable to meet the demand of people to deal with the data efficiently. More and more enterprises are looking to the open source Hadoop cloud platform, using HBase to store and manage data. Data reading in HBase can be parallelized by MapReduce framework, so the processing speed is much higher than that of traditional database management. However, under this framework, the speed of HBase data reading is still unable to catch up with the speed of data processing. The main problem lies in the fact that the MapReduce data access mode of HBase can not completely guarantee the nativeness of the data. This paper first introduces big data's relevant knowledge, including big data storage technology and big data processing technology, summarizes the classification, characteristics and main platforms of cloud computing, and focuses on three key technologies of the most widely used Hadoop cloud platform. HDFS,MapReduce and HBase. It provides a theoretical basis for analyzing and improving the MapReduce process of HBase. Then through deeply analyzing the flow of task allocation of MapReduce framework in HBase, the process of data fragmentation and the workflow of data reading interface (Scan), the bottleneck of MapRedcue calculation of HBase is found: 1) the task can not be done locally; 2) data reading in Region is serial; 3) data needs to be merged to form a record at a time. In order to solve the above problems, an improved method is proposed, in which the original logical storage unit (Region) is not taken as the basic unit of task allocation, but the physical storage unit (Block) of HBase is taken as the basic unit of task allocation. This paper redesigns the method of data partitioning and proposes a scheduling strategy based on local task first MapReduce using Huazhong Jie. Finally, through the contrast experiment, it is proved that the improved interface cancels the extra processing work of the Scan interface, strengthens the local data, reduces the time spent on accessing the data to 1 / 10 of the original interface, and saves the working time very well. Thus, the work efficiency is improved effectively.
【學位授予單位】:國防科學技術(shù)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP311.13;TP333
本文編號:2304468
[Abstract]:With the rapid development of information technology, the amount of data on the Internet is growing rapidly, and the types of data are also varied. The world has been transferred to the data-centered paradigm "big data" era. The traditional data processing technology mainly adopts the database management mode. Facing big data, the storage space is not easy to expand and the query efficiency is low, which is more and more unable to meet the demand of people to deal with the data efficiently. More and more enterprises are looking to the open source Hadoop cloud platform, using HBase to store and manage data. Data reading in HBase can be parallelized by MapReduce framework, so the processing speed is much higher than that of traditional database management. However, under this framework, the speed of HBase data reading is still unable to catch up with the speed of data processing. The main problem lies in the fact that the MapReduce data access mode of HBase can not completely guarantee the nativeness of the data. This paper first introduces big data's relevant knowledge, including big data storage technology and big data processing technology, summarizes the classification, characteristics and main platforms of cloud computing, and focuses on three key technologies of the most widely used Hadoop cloud platform. HDFS,MapReduce and HBase. It provides a theoretical basis for analyzing and improving the MapReduce process of HBase. Then through deeply analyzing the flow of task allocation of MapReduce framework in HBase, the process of data fragmentation and the workflow of data reading interface (Scan), the bottleneck of MapRedcue calculation of HBase is found: 1) the task can not be done locally; 2) data reading in Region is serial; 3) data needs to be merged to form a record at a time. In order to solve the above problems, an improved method is proposed, in which the original logical storage unit (Region) is not taken as the basic unit of task allocation, but the physical storage unit (Block) of HBase is taken as the basic unit of task allocation. This paper redesigns the method of data partitioning and proposes a scheduling strategy based on local task first MapReduce using Huazhong Jie. Finally, through the contrast experiment, it is proved that the improved interface cancels the extra processing work of the Scan interface, strengthens the local data, reduces the time spent on accessing the data to 1 / 10 of the original interface, and saves the working time very well. Thus, the work efficiency is improved effectively.
【學位授予單位】:國防科學技術(shù)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP311.13;TP333
【參考文獻】
相關(guān)期刊論文 前8條
1 孫健;賈曉菁;;Google云計算平臺的技術(shù)架構(gòu)及對其成本的影響研究[J];電信科學;2010年01期
2 朱頌;;分布式文件系統(tǒng)HDFS的分析[J];福建電腦;2012年04期
3 劉琦琳;;IBM云計算:從理想到實踐[J];互聯(lián)網(wǎng)周刊;2009年11期
4 侯建;帥仁俊;侯文;;基于云計算的海量數(shù)據(jù)存儲模型[J];通信技術(shù);2011年05期
5 趙華茗;;搭建基于云計算的開源海量數(shù)據(jù)挖掘平臺[J];現(xiàn)代圖書情報技術(shù);2010年10期
6 王勇;;Google VS微軟:云端對決[J];中國企業(yè)家;2008年22期
7 牛莉麗;;云計算環(huán)境下的圖書館服務[J];醫(yī)學信息學雜志;2012年07期
8 郝樹魁;;Hadoop HDFS和MapReduce架構(gòu)淺析[J];郵電設(shè)計技術(shù);2012年07期
本文編號:2304468
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2304468.html
最近更新
教材專著