基于Spark的分布式ETL研究與應(yīng)用

發(fā)布時間：2018-05-02 21:12

本文選題：大數(shù)據(jù) + 分布式ETL��；參考：《東華大學(xué)》2017年碩士論文

【摘要】：大數(shù)據(jù)時代,越來越多的數(shù)據(jù)需要被人們處理和使用。對于企業(yè)來說,數(shù)據(jù)已經(jīng)成為企業(yè)的生存基礎(chǔ),能否利用好自己的數(shù)據(jù)對企業(yè)的未來發(fā)展至關(guān)重要。數(shù)據(jù)倉庫技術(shù)為企業(yè)分析海量數(shù)據(jù)提供了一種有效方案,而在數(shù)據(jù)倉庫的構(gòu)建過程中,ETL往往是整個過程中最為耗時和復(fù)雜的階段。處理數(shù)據(jù)量的日益增長,對ETL技術(shù)提出了更高的性能要求,也帶來了更大的挑戰(zhàn)。為了應(yīng)對海量數(shù)據(jù)的ETL處理需求,基于分布式并行技術(shù)進(jìn)行ETL很有必要。當(dāng)前基于MapReduce范型實(shí)現(xiàn)的分布式ETL方案能夠?qū)崿F(xiàn)海量數(shù)據(jù)的高效處理,但是由于Map Reduce編程模型的限制,即只有Map/Reduce兩種處理方式,以及多步的處理過程中存在的高I/O開銷,使其在ETL的轉(zhuǎn)換過程中存在一些性能問題,處理效率和處理速度方面還有許多優(yōu)化空間。針對大數(shù)據(jù)的“海量”特征,以及基于Map Reduce范型實(shí)現(xiàn)的分布式ETL方案的局限性,本文結(jié)合數(shù)據(jù)倉庫理論知識和分布式處理技術(shù),基于Spark對分布式并行ETL技術(shù)進(jìn)行了研究,提出了一種分布式ETL的設(shè)計(jì)方案,重點(diǎn)研究了數(shù)據(jù)轉(zhuǎn)換過程中轉(zhuǎn)換處理的并行實(shí)現(xiàn),根據(jù)不同的轉(zhuǎn)換處理類型給出了適用的解決方法。針對前期非聚集操作,如基本的數(shù)據(jù)清洗,數(shù)據(jù)格式標(biāo)準(zhǔn)化操作,提出了基于分區(qū)的并行管道處理算法,以分區(qū)為單位進(jìn)行數(shù)據(jù)處理,從而提高數(shù)據(jù)轉(zhuǎn)換的效率;對于聚集操作,如事實(shí)表的數(shù)值數(shù)據(jù)的聚合操作,采用了分區(qū)預(yù)聚合方法,以減少數(shù)據(jù)傳輸頻率。實(shí)驗(yàn)結(jié)果表明,提出的方法能夠明顯加速大數(shù)據(jù)量的轉(zhuǎn)換處理,進(jìn)而提高分布式ETL的性能和處理效率。之后本文對基于Spark的數(shù)據(jù)處理流程進(jìn)行了性能優(yōu)化研究。詳細(xì)分析了Spark在處理中的常見數(shù)據(jù)傾斜問題,根據(jù)不同場景下的數(shù)據(jù)傾斜情況,分別給出了對應(yīng)的并行調(diào)優(yōu)策略。相關(guān)實(shí)驗(yàn)表明了調(diào)優(yōu)策略的有效性。最后,基于一個實(shí)際的決策支持系統(tǒng)開發(fā),闡述了基于Spark的分布式ETL的設(shè)計(jì)與應(yīng)用情況,包括與傳統(tǒng)ETL開發(fā)方案的比較分析,分析結(jié)果表明了本文提出的基于Spark的分布式ETL方案的有效性和高可擴(kuò)展性。
[Abstract]:Big data era, more and more data need to be processed and used by people. For enterprises, data has become the survival basis of enterprises, whether to make good use of their own data is very important for the future development of enterprises. Data warehouse technology provides an effective solution for enterprise to analyze massive data, and ETL is often the most time-consuming and complex stage in the process of building data warehouse. With the increasing amount of data processing, higher performance requirements and greater challenges for ETL technology have been put forward. In order to deal with the ETL processing requirement of massive data, it is necessary to implement ETL based on distributed parallel technology. The current distributed ETL scheme based on MapReduce norm can efficiently process massive data. However, due to the limitation of Map Reduce programming model, there are only two kinds of Map/Reduce processing methods, and the high I / O overhead in the process of multi-step processing. There are some performance problems in the conversion process of ETL, and there is much room for optimization in processing efficiency and processing speed. In view of big data's "magnanimity" characteristic and the limitation of distributed ETL scheme based on Map Reduce norm, this paper studies distributed parallel ETL technology based on Spark, combined with data warehouse theory knowledge and distributed processing technology. In this paper, a design scheme of distributed ETL is presented. The parallel implementation of conversion processing in the process of data conversion is studied, and the suitable solutions are given according to different types of conversion processing. A parallel pipeline processing algorithm based on partitioning is proposed to deal with non-aggregate operations, such as basic data cleaning and data format standardization, in order to improve the efficiency of data conversion. For aggregation operations, such as the aggregation of numerical data in fact tables, a partitioned preaggregation method is used to reduce the frequency of data transmission. The experimental results show that the proposed method can accelerate the conversion of large amount of data and improve the performance and processing efficiency of distributed ETL. Then, the performance optimization of data processing flow based on Spark is studied in this paper. The common data skew problem in the processing of Spark is analyzed in detail. According to the data skew in different scenarios, the corresponding parallel tuning strategies are given. Experiments show the effectiveness of the tuning strategy. Finally, based on the development of a practical decision support system, the design and application of distributed ETL based on Spark are described, including the comparison and analysis with the traditional ETL development scheme. The results show that the proposed distributed ETL scheme based on Spark is effective and scalable.
【學(xué)位授予單位】：東華大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【引證文獻(xiàn)】

相關(guān)期刊論文前1條

1 丁祥武;解書亮;李繼云;;基于Spark的并行ETL[J];計(jì)算機(jī)工程與設(shè)計(jì);2017年09期

，

本文編號：1835522

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1835522.html

上一篇：虛擬現(xiàn)實(shí)技術(shù)在專門用途英語教學(xué)中的應(yīng)用研究綜述
下一篇：基于GCC關(guān)鍵變量數(shù)據(jù)流分析算法的程序切片技術(shù)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Spark的分布式ETL研究與應(yīng)用