天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

基于Spark的分布式ETL研究與應用

發(fā)布時間:2018-05-02 21:12

  本文選題:大數(shù)據(jù) + 分布式ETL ; 參考:《東華大學》2017年碩士論文


【摘要】:大數(shù)據(jù)時代,越來越多的數(shù)據(jù)需要被人們處理和使用。對于企業(yè)來說,數(shù)據(jù)已經成為企業(yè)的生存基礎,能否利用好自己的數(shù)據(jù)對企業(yè)的未來發(fā)展至關重要。數(shù)據(jù)倉庫技術為企業(yè)分析海量數(shù)據(jù)提供了一種有效方案,而在數(shù)據(jù)倉庫的構建過程中,ETL往往是整個過程中最為耗時和復雜的階段。處理數(shù)據(jù)量的日益增長,對ETL技術提出了更高的性能要求,也帶來了更大的挑戰(zhàn)。為了應對海量數(shù)據(jù)的ETL處理需求,基于分布式并行技術進行ETL很有必要。當前基于MapReduce范型實現(xiàn)的分布式ETL方案能夠實現(xiàn)海量數(shù)據(jù)的高效處理,但是由于Map Reduce編程模型的限制,即只有Map/Reduce兩種處理方式,以及多步的處理過程中存在的高I/O開銷,使其在ETL的轉換過程中存在一些性能問題,處理效率和處理速度方面還有許多優(yōu)化空間。針對大數(shù)據(jù)的“海量”特征,以及基于Map Reduce范型實現(xiàn)的分布式ETL方案的局限性,本文結合數(shù)據(jù)倉庫理論知識和分布式處理技術,基于Spark對分布式并行ETL技術進行了研究,提出了一種分布式ETL的設計方案,重點研究了數(shù)據(jù)轉換過程中轉換處理的并行實現(xiàn),根據(jù)不同的轉換處理類型給出了適用的解決方法。針對前期非聚集操作,如基本的數(shù)據(jù)清洗,數(shù)據(jù)格式標準化操作,提出了基于分區(qū)的并行管道處理算法,以分區(qū)為單位進行數(shù)據(jù)處理,從而提高數(shù)據(jù)轉換的效率;對于聚集操作,如事實表的數(shù)值數(shù)據(jù)的聚合操作,采用了分區(qū)預聚合方法,以減少數(shù)據(jù)傳輸頻率。實驗結果表明,提出的方法能夠明顯加速大數(shù)據(jù)量的轉換處理,進而提高分布式ETL的性能和處理效率。之后本文對基于Spark的數(shù)據(jù)處理流程進行了性能優(yōu)化研究。詳細分析了Spark在處理中的常見數(shù)據(jù)傾斜問題,根據(jù)不同場景下的數(shù)據(jù)傾斜情況,分別給出了對應的并行調優(yōu)策略。相關實驗表明了調優(yōu)策略的有效性。最后,基于一個實際的決策支持系統(tǒng)開發(fā),闡述了基于Spark的分布式ETL的設計與應用情況,包括與傳統(tǒng)ETL開發(fā)方案的比較分析,分析結果表明了本文提出的基于Spark的分布式ETL方案的有效性和高可擴展性。
[Abstract]:Big data era, more and more data need to be processed and used by people. For enterprises, data has become the survival basis of enterprises, whether to make good use of their own data is very important for the future development of enterprises. Data warehouse technology provides an effective solution for enterprise to analyze massive data, and ETL is often the most time-consuming and complex stage in the process of building data warehouse. With the increasing amount of data processing, higher performance requirements and greater challenges for ETL technology have been put forward. In order to deal with the ETL processing requirement of massive data, it is necessary to implement ETL based on distributed parallel technology. The current distributed ETL scheme based on MapReduce norm can efficiently process massive data. However, due to the limitation of Map Reduce programming model, there are only two kinds of Map/Reduce processing methods, and the high I / O overhead in the process of multi-step processing. There are some performance problems in the conversion process of ETL, and there is much room for optimization in processing efficiency and processing speed. In view of big data's "magnanimity" characteristic and the limitation of distributed ETL scheme based on Map Reduce norm, this paper studies distributed parallel ETL technology based on Spark, combined with data warehouse theory knowledge and distributed processing technology. In this paper, a design scheme of distributed ETL is presented. The parallel implementation of conversion processing in the process of data conversion is studied, and the suitable solutions are given according to different types of conversion processing. A parallel pipeline processing algorithm based on partitioning is proposed to deal with non-aggregate operations, such as basic data cleaning and data format standardization, in order to improve the efficiency of data conversion. For aggregation operations, such as the aggregation of numerical data in fact tables, a partitioned preaggregation method is used to reduce the frequency of data transmission. The experimental results show that the proposed method can accelerate the conversion of large amount of data and improve the performance and processing efficiency of distributed ETL. Then, the performance optimization of data processing flow based on Spark is studied in this paper. The common data skew problem in the processing of Spark is analyzed in detail. According to the data skew in different scenarios, the corresponding parallel tuning strategies are given. Experiments show the effectiveness of the tuning strategy. Finally, based on the development of a practical decision support system, the design and application of distributed ETL based on Spark are described, including the comparison and analysis with the traditional ETL development scheme. The results show that the proposed distributed ETL scheme based on Spark is effective and scalable.
【學位授予單位】:東華大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13

【引證文獻】

相關期刊論文 前1條

1 丁祥武;解書亮;李繼云;;基于Spark的并行ETL[J];計算機工程與設計;2017年09期



本文編號:1835522

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1835522.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶6affc***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com