面向工業(yè)大數(shù)據(jù)的分布式ETL系統(tǒng)的設(shè)計與實現(xiàn)
[Abstract]:Because of the rapid development of the Internet and computer technology, the productivity, relations of production, and production technology caused by the deep integration with the industrial system have been increased since the beginning of the 4.0 era of industry. The deep transformation of business model and innovation mode makes the whole industrial system move toward the revolutionary transformation of full intelligence. Industry big data analysis is the future industry in the global market play a key area of competitive advantage. With the advent of the Internet of things and the age of information physics systems, more data can be collected, analyzed, and used to make more informed decisions. In the whole process of big data's analysis, how the historical data converge from the various data sources to the analysis system, and how the real-time data is loaded into the analysis system from each sensor becomes the basis of the whole data analysis. This will use the data processing tool ETL (Extract-Transform-Load, extraction, transformation, loading). The traditional ETL is mostly run in parallel in a single computer system, and its processing speed and processing capacity are far from meeting the requirements of industrial data analysis. The commercial ETL performance is good, but the price is expensive, and the request to the hardware system is too high, cannot achieve the popularization. In view of the above situation, this paper designs and implements a low price and high performance distributed ETL system for industrial data processing. The design of distributed ETL system is divided into three modules: data extraction module, data conversion module and data loading module. In the stage of data extraction, we mainly design change data capture scheme based on table trigger, differential data synchronization scheme based on data verification and real-time data extraction scheme based on Pub/Sub communication mode based on Redis. In the data conversion stage, the batch layer and the acceleration layer are designed according to the requirements of the data processing speed and the processing capacity, respectively. The batch layer mainly processes the historical data with low real-time requirements, and the MapReduce based on Hadoop is implemented. The real-time data processing in acceleration layer is based on Spark Streaming stream processing. In the data loading stage, the loading of structured data is mainly handled by Sqoop, and the loading of unstructured data is handled by HDFS client. Finally, the function and performance of the distributed ETL system are tested. The experimental results show that the ETL system designed in this paper has better performance in dealing with the problem of industrial big data, which has a strong practical significance for the information transformation of industrial data.
【學位授予單位】:中國科學院大學(中國科學院沈陽計算技術(shù)研究所)
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關(guān)期刊論文 前8條
1 文亞;;德國公共風險管理的經(jīng)驗與啟示[J];中國行政管理;2015年04期
2 鄭軍;尹兆濤;;中國石油應(yīng)對“大數(shù)據(jù)”的策略分析[J];石油規(guī)劃設(shè)計;2013年06期
3 宋杰;郝文寧;陳剛;靳大尉;趙水寧;;基于MapReduce的分布式ETL體系結(jié)構(gòu)研究[J];計算機科學;2013年06期
4 段成;王增平;吳克河;;一種輕量級電網(wǎng)實時數(shù)據(jù)ETL系統(tǒng)的設(shè)計與實現(xiàn)[J];電力系統(tǒng)保護與控制;2010年18期
5 戴浩;楊波;;ETL中的數(shù)據(jù)增量抽取機制研究[J];計算機工程與設(shè)計;2009年23期
6 馬瑞新;許力;;基于SOA的實時ETL的研究與實現(xiàn)[J];計算機工程與科學;2007年08期
7 祁利剛;候小靜;;基于數(shù)據(jù)倉庫的ETL技術(shù)研究[J];中國電力教育;2006年S1期
8 章水鑫,徐宏炳,于立;增量式ETL工具的研究與實現(xiàn)[J];現(xiàn)代計算機(專業(yè)版);2005年03期
相關(guān)碩士學位論文 前10條
1 林建昌;電力行業(yè)分布式ETL數(shù)據(jù)集成系統(tǒng)研究與實現(xiàn)[D];電子科技大學;2015年
2 陳洪江;MapReduce下容錯機制的研究與優(yōu)化[D];哈爾濱工業(yè)大學;2014年
3 趙賽;云存儲中基于動態(tài)多中心的分布式文件系統(tǒng)研究[D];燕山大學;2014年
4 李W,
本文編號:2204665
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2204665.html