面向工業(yè)大數(shù)據(jù)的分布式ETL系統(tǒng)的設(shè)計與實現(xiàn)

發(fā)布時間：2018-08-26 10:59

【摘要】：自從進入工業(yè)4.0時代以來,由于互聯(lián)網(wǎng)和計算機技術(shù)的高速發(fā)展,在與工業(yè)系統(tǒng)深度融合過程中引發(fā)的生產(chǎn)力、生產(chǎn)關(guān)系、生產(chǎn)技術(shù)、商業(yè)模式以及創(chuàng)新模式等方面的深度變革,使整個工業(yè)系統(tǒng)邁向全面智能化的革命性轉(zhuǎn)變。工業(yè)大數(shù)據(jù)分析是未來工業(yè)在全球市場中發(fā)揮競爭優(yōu)勢的關(guān)鍵領(lǐng)域。隨著物聯(lián)網(wǎng)和信息物理系統(tǒng)時代的來臨,更多數(shù)據(jù)可以被收集和分析,并用于做出更明智的決策。在整個工業(yè)大數(shù)據(jù)分析的過程中,歷史數(shù)據(jù)如何從各個數(shù)據(jù)源匯聚到分析系統(tǒng)中、實時數(shù)據(jù)如何從各個傳感器加載到分析系統(tǒng)中成為整個數(shù)據(jù)分析的基礎(chǔ)。這就要用到數(shù)據(jù)處理工具ETL(Extract-Transform-Load,抽取、轉(zhuǎn)換、加載)。傳統(tǒng)的ETL多是在單機系統(tǒng)下并行運行,其處理速度和處理量遠遠不能滿足工業(yè)數(shù)據(jù)分析的要求。而商業(yè)ETL性能好,但是價格昂貴,而且對硬件系統(tǒng)的要求太高,無法做到普及。針對以上情況,本文針對工業(yè)數(shù)據(jù)處理設(shè)計并實現(xiàn)了一種價格低廉、性能高的分布式ETL系統(tǒng)。本文分布式ETL系統(tǒng)的設(shè)計主要分三個模塊展開:數(shù)據(jù)抽取模塊、數(shù)據(jù)轉(zhuǎn)換模塊以及數(shù)據(jù)加載模塊。數(shù)據(jù)抽取階段主要設(shè)計了基于分表觸發(fā)器的變更數(shù)據(jù)捕獲方案、基于數(shù)據(jù)校驗的差異數(shù)據(jù)同步方案和基于Redis的Pub/Sub通信模式的實時數(shù)據(jù)抽取方案。數(shù)據(jù)轉(zhuǎn)換階段主要根據(jù)數(shù)據(jù)對處理速度和處理量的要求分別設(shè)計了批處理層和加速層,批處理層主要處理對實時性要求不高的歷史數(shù)據(jù),基于Hadoop的MapReduce實現(xiàn);加速層主要處理的實時數(shù)據(jù),基于Spark Streaming流處理方式實現(xiàn)。數(shù)據(jù)加載階段主要由Sqoop來處理結(jié)構(gòu)化數(shù)據(jù)的加載、由HDFS客戶端來處理非結(jié)構(gòu)化數(shù)據(jù)的加載。最后本文對設(shè)計的分布式ETL系統(tǒng)分別進行了功能測試和性能測試。試驗結(jié)果表明,本文設(shè)計的ETL系統(tǒng)在處理工業(yè)大數(shù)據(jù)的問題上具有較好的性能,這對工業(yè)數(shù)據(jù)的信息化改造具有較強的實際意義。
[Abstract]:Because of the rapid development of the Internet and computer technology, the productivity, relations of production, and production technology caused by the deep integration with the industrial system have been increased since the beginning of the 4.0 era of industry. The deep transformation of business model and innovation mode makes the whole industrial system move toward the revolutionary transformation of full intelligence. Industry big data analysis is the future industry in the global market play a key area of competitive advantage. With the advent of the Internet of things and the age of information physics systems, more data can be collected, analyzed, and used to make more informed decisions. In the whole process of big data's analysis, how the historical data converge from the various data sources to the analysis system, and how the real-time data is loaded into the analysis system from each sensor becomes the basis of the whole data analysis. This will use the data processing tool ETL (Extract-Transform-Load, extraction, transformation, loading). The traditional ETL is mostly run in parallel in a single computer system, and its processing speed and processing capacity are far from meeting the requirements of industrial data analysis. The commercial ETL performance is good, but the price is expensive, and the request to the hardware system is too high, cannot achieve the popularization. In view of the above situation, this paper designs and implements a low price and high performance distributed ETL system for industrial data processing. The design of distributed ETL system is divided into three modules: data extraction module, data conversion module and data loading module. In the stage of data extraction, we mainly design change data capture scheme based on table trigger, differential data synchronization scheme based on data verification and real-time data extraction scheme based on Pub/Sub communication mode based on Redis. In the data conversion stage, the batch layer and the acceleration layer are designed according to the requirements of the data processing speed and the processing capacity, respectively. The batch layer mainly processes the historical data with low real-time requirements, and the MapReduce based on Hadoop is implemented. The real-time data processing in acceleration layer is based on Spark Streaming stream processing. In the data loading stage, the loading of structured data is mainly handled by Sqoop, and the loading of unstructured data is handled by HDFS client. Finally, the function and performance of the distributed ETL system are tested. The experimental results show that the ETL system designed in this paper has better performance in dealing with the problem of industrial big data, which has a strong practical significance for the information transformation of industrial data.
【學位授予單位】：中國科學院大學(中國科學院沈陽計算技術(shù)研究所)
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.13

【參考文獻】

相關(guān)期刊論文前8條

1 文亞;;德國公共風險管理的經(jīng)驗與啟示[J];中國行政管理;2015年04期

2 鄭軍;尹兆濤;;中國石油應對“大數(shù)據(jù)”的策略分析[J];石油規(guī)劃設(shè)計;2013年06期

3 宋杰;郝文寧;陳剛;靳大尉;趙水寧;;基于MapReduce的分布式ETL體系結(jié)構(gòu)研究[J];計算機科學;2013年06期

4 段成;王增平;吳克河;;一種輕量級電網(wǎng)實時數(shù)據(jù)ETL系統(tǒng)的設(shè)計與實現(xiàn)[J];電力系統(tǒng)保護與控制;2010年18期

5 戴浩;楊波;;ETL中的數(shù)據(jù)增量抽取機制研究[J];計算機工程與設(shè)計;2009年23期

6 馬瑞新;許力;;基于SOA的實時ETL的研究與實現(xiàn)[J];計算機工程與科學;2007年08期

7 祁利剛;候小靜;;基于數(shù)據(jù)倉庫的ETL技術(shù)研究[J];中國電力教育;2006年S1期

8 章水鑫,徐宏炳,于立;增量式ETL工具的研究與實現(xiàn)[J];現(xiàn)代計算機(專業(yè)版);2005年03期

相關(guān)碩士學位論文前10條

1 林建昌;電力行業(yè)分布式ETL數(shù)據(jù)集成系統(tǒng)研究與實現(xiàn)[D];電子科技大學;2015年

2 陳洪江;MapReduce下容錯機制的研究與優(yōu)化[D];哈爾濱工業(yè)大學;2014年

3 趙賽;云存儲中基于動態(tài)多中心的分布式文件系統(tǒng)研究[D];燕山大學;2014年

4 李W，

本文編號：2204665

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2204665.html

上一篇：對話系統(tǒng)評價方法綜述
下一篇：基于顏色飽和度的快速圖像去霧研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向工業(yè)大數(shù)據(jù)的分布式ETL系統(tǒng)的設(shè)計與實現(xiàn)