企業(yè)環(huán)境下分布式數(shù)據(jù)倉庫的設計與優(yōu)化技術的研究

發(fā)布時間：2018-05-06 20:15

本文選題：分布式系統(tǒng) + 數(shù)據(jù)倉庫��；參考：《北京郵電大學》2016年碩士論文

【摘要】：進入新世紀以來,在互聯(lián)網(wǎng)、物聯(lián)網(wǎng)技術的帶動下,企業(yè)可獲得的數(shù)據(jù)量也越來越大。企業(yè)對數(shù)據(jù)的需求也不再只針對日常的事務處理,很多企業(yè)開始構建大型的數(shù)據(jù)倉庫來存儲和分析面臨的海量數(shù)據(jù)。數(shù)據(jù)倉庫收集不同來源和不同結構的用戶數(shù)據(jù),并把這些數(shù)據(jù)按主題進行分類和集成,使得對同一主題的數(shù)據(jù)的分析結果更有針對性和可靠性,對管理人員的決策也更有參考價值。目前傳統(tǒng)集中式的數(shù)據(jù)倉庫由于在擴展性和性能方面的不足,已開始無法承受對海量數(shù)據(jù)的處理壓力。Hadoop的興起使人們認識到分布式技術的強大計算能力,分布式架構的數(shù)據(jù)倉庫將成為未來數(shù)據(jù)倉庫系統(tǒng)的發(fā)展方向。針對這種情況,本文從數(shù)據(jù)倉庫的分布式架構設計、元數(shù)據(jù)的統(tǒng)一管理、數(shù)據(jù)倉庫技術與Hadoop開源框架相結合三方面做出分析和設計。結合Hadoop開源框架、My SQL數(shù)據(jù)庫、分布式存儲技術、impala并行查詢技術,設計了一套完整的系統(tǒng)架構方案。以MapReduce任務的方式完成對源數(shù)據(jù)的集成,即ETL(Extract-Transform-Load)工作。在元數(shù)據(jù)管理方面,研究了數(shù)據(jù)倉庫系統(tǒng)的元數(shù)據(jù)管理機制,以及impala查詢引擎的元數(shù)據(jù)實現(xiàn)方案,設計和實現(xiàn)了基于MySQL的集中式元數(shù)據(jù)管理模塊。該系統(tǒng)首先通過MapReduce任務對源數(shù)據(jù)進行抽取和轉(zhuǎn)換,將中間結果數(shù)據(jù)按照用戶指定的數(shù)據(jù)切分方式進行數(shù)據(jù)的分布式劃分,之后進行并行導入;由MySQL數(shù)據(jù)庫以lib的形式存儲和管理系統(tǒng)的元數(shù)據(jù);存儲部分使用一種高效單機存儲引擎,實現(xiàn)各存儲節(jié)點對數(shù)據(jù)的高效存儲和掃描;數(shù)據(jù)的查詢通過impala并行查詢引擎實現(xiàn),查詢與存儲共用一套元數(shù)據(jù)方案,實現(xiàn)了元數(shù)據(jù)信息的統(tǒng)一管理。通過該系統(tǒng),企業(yè)用戶不僅可以實現(xiàn)海量數(shù)據(jù)的高效管理,也可對數(shù)據(jù)進行多維分析處理,為企業(yè)策略的指定和調(diào)整提供數(shù)據(jù)支持。最后,通過實驗測試分布式系統(tǒng)的導入和查詢性能,通過對測試結果的分析說明該系統(tǒng)在處理企業(yè)數(shù)據(jù)方面是有效的。
[Abstract]:Since entering the new century, with the Internet of things and Internet of things technology, enterprises can obtain more and more data. The demand of enterprises for data is no longer only for daily transaction processing, many enterprises begin to build large data warehouse to store and analyze the huge amount of data. The data warehouse collects user data from different sources and structures, classifies and integrates the data by topic, making the analysis of data on the same subject more relevant and reliable, It is also more valuable for managers to make decisions. At present, due to the lack of scalability and performance of traditional centralized data warehouse, it has been unable to bear the pressure of processing mass data. Hadoop has made people realize the powerful computing power of distributed technology. Data warehouse with distributed architecture will become the development direction of data warehouse system in the future. Aiming at this situation, this paper analyzes and designs the distributed architecture design of data warehouse, the unified management of metadata, the combination of data warehouse technology and Hadoop open source framework. Combined with Hadoop open source framework, my SQL database, distributed storage technology and impala parallel query technology, a complete system architecture scheme is designed. The integration of source data is accomplished by MapReduce task, that is, ETLX Extract-Transform-Load. In the aspect of metadata management, the metadata management mechanism of data warehouse system and the metadata implementation scheme of impala query engine are studied. The centralized metadata management module based on MySQL is designed and implemented. The system firstly extracts and transforms the source data through the MapReduce task, divides the intermediate result data according to the data segmentation mode specified by the user, and then carries on the parallel import. The metadata of the system is stored and managed by the MySQL database in the form of lib. The storage part uses an efficient single-machine storage engine to realize the efficient storage and scanning of the data of each storage node, and the query of the data is realized by the impala parallel query engine. Query and storage share a set of metadata scheme to realize the unified management of metadata information. Through this system, enterprise users can not only realize the efficient management of massive data, but also carry out multidimensional analysis and processing of the data, and provide data support for the designation and adjustment of enterprise policies. Finally, the paper tests the import and query performance of the distributed system through experiments. The analysis of the test results shows that the system is effective in dealing with enterprise data.
【學位授予單位】：北京郵電大學
【學位級別】：碩士
【學位授予年份】：2016
【分類號】：TP311.13

【相似文獻】

相關期刊論文前10條

1 金巖;數(shù)據(jù)倉庫與圖書館的發(fā)展[J];現(xiàn)代圖書情報技術;2000年03期

2 史金紅,吳永明;影響數(shù)據(jù)倉庫成功的關鍵因素[J];電子工程師;2000年01期

3 宋玉長,李本勇,郭小紅;如何構建銀行數(shù)據(jù)倉庫[J];上海微型計算機;2000年47期

4 賈納豫;數(shù)據(jù)倉庫的概念與機制[J];玉溪師范學院學報;2000年S1期

5 陳京民;數(shù)據(jù)倉庫開發(fā)的規(guī)劃研究[J];計算機與網(wǎng)絡;2000年09期

6 楊順生;數(shù)據(jù)倉庫鎖緊商業(yè)銀行(上)[J];中國計算機用戶;2000年04期

7 ;數(shù)據(jù)倉庫仔細看[J];每周電腦報;2000年10期

8 禾川;;數(shù)據(jù)倉庫起熱潮之應用篇[J];每周電腦報;2000年48期

9 楊順生;;中國商業(yè)銀行應實施數(shù)據(jù)倉庫[J];金融電子化;2000年03期

10 顧曉姝;;數(shù)據(jù)倉庫體系及其實現(xiàn)[J];運城高等�？茖W校學報;2000年S1期

相關會議論文前10條

1 陳金雄;劉雄飛;王慶森;;醫(yī)院數(shù)據(jù)倉庫的設計與實現(xiàn)[A];首屆中國IT與醫(yī)藥衛(wèi)生高層論壇論文集[C];2004年

2 何朝紅;;數(shù)據(jù)倉庫在我國企業(yè)的應用現(xiàn)狀及實施策略分析[A];廣西計算機學會2006年年會論文集[C];2006年

3 劉奇;;腫瘤專業(yè)數(shù)據(jù)倉庫的建立[A];第四屆中國腫瘤學術大會暨第五屆海峽兩岸腫瘤學術會議教育集[C];2006年

4 郭遠遠;舒紅平;宮蕊;;基于數(shù)據(jù)倉庫的構建和馬爾可夫過程的應用研究[A];2008'中國信息技術與應用學術論壇論文集（二）[C];2008年

5 金周;;基于數(shù)據(jù)倉庫的能耗指標查詢體系[A];全國冶金自動化信息網(wǎng)2009年會論文集[C];2009年

6 李潔;李慶忠;王海洋;;一種有效的在線修改數(shù)據(jù)倉庫算法[A];第十六屆全國數(shù)據(jù)庫學術會議論文集[C];1999年

7 馮建華;蔣旭東;劉建民;周立柱;;面向市場分析與預測的數(shù)據(jù)倉庫平臺[A];第十六屆全國數(shù)據(jù)庫學術會議論文集[C];1999年

8 王曉玲;謝鴻強;劉安;董逸生;;數(shù)據(jù)倉庫建模工具的研制[A];第十七屆全國數(shù)據(jù)庫學術會議論文集（研究報告篇）[C];2000年

9 張德輝;李建中;;多維壓縮數(shù)據(jù)倉庫上的并行聚集算法[A];第十七屆全國數(shù)據(jù)庫學術會議論文集（研究報告篇）[C];2000年

10 潘海為;李建中;;數(shù)據(jù)倉庫的并行加載算法[A];第十七屆全國數(shù)據(jù)庫學術會議論文集（研究報告篇）[C];2000年

相關重要報紙文章前10條

1 本報記者　　于　岫;建設數(shù)據(jù)倉庫打造信息時代的“航母”[N];中國國門時報;2005年

2 莊川編譯;如何邁出實施數(shù)據(jù)倉庫的第一步[N];中國計算機報;2005年

3 本報記者龔杰;數(shù)據(jù)倉庫解決策之惑[N];計算機世界;2004年

4 記者王璐;上證所建成中國金融業(yè)最大數(shù)據(jù)倉庫[N];上海證券報;2005年

5 ;數(shù)據(jù)倉庫的歷史[N];中華讀書報;2003年

6 萬振龍;動態(tài)數(shù)據(jù)倉庫承接歷史與未來[N];網(wǎng)絡世界;2009年

7 本報記者薛斐;數(shù)據(jù)倉庫沙中淘金[N];計算機世界;2002年

8 本報記者王向東;“數(shù)據(jù)倉庫不是玩酷”[N];計算機世界;2003年

9 本報記者潘永花;數(shù)據(jù)倉庫崢嶸時[N];網(wǎng)絡世界;2003年

10 本報記者潘永花;數(shù)據(jù)倉庫創(chuàng)新與眾不同[N];網(wǎng)絡世界;2010年

相關博士學位論文前10條

1 宋旭東;企業(yè)集團數(shù)據(jù)倉庫系統(tǒng)關鍵技術研究[D];大連理工大學;2010年

2 陳燕;數(shù)據(jù)倉庫的設計與實現(xiàn)[D];大連理工大學;2000年

3 馮玉;數(shù)據(jù)倉庫環(huán)境中近似查詢處理技術研究[D];中國科學院研究生院（計算技術研究所）;2002年

4 孫劍;海洋環(huán)境數(shù)據(jù)倉庫與數(shù)據(jù)挖掘應用研究[D];中國海洋大學;2011年

5 栗然;電力負荷分析與預測的分布式數(shù)據(jù)倉庫和數(shù)據(jù)挖掘研究[D];華北電力大學（河北）;2009年

6 李學鋒;礦山企業(yè)數(shù)據(jù)倉庫的應用研究[D];昆明理工大學;2005年

7 馬軍杰;基于數(shù)據(jù)倉庫與聯(lián)機處理的區(qū)域經(jīng)濟發(fā)展管理決策支持系統(tǒng)研究[D];華東師范大學;2007年

8 陳金玉;數(shù)據(jù)倉庫實體化視圖聯(lián)機—致性維護研究[D];重慶大學;2002年

9 趙貴菊;勘探開發(fā)數(shù)據(jù)倉庫的模型研究和應用[D];中國地質(zhì)大學（北京）;2010年

10 朱傳華;三峽庫區(qū)地質(zhì)災害數(shù)據(jù)倉庫與數(shù)據(jù)挖掘應用研究[D];中國地質(zhì)大學;2010年

相關碩士學位論文前10條

1 高鑫磊;企業(yè)環(huán)境下分布式數(shù)據(jù)倉庫的設計與優(yōu)化技術的研究[D];北京郵電大學;2016年

2 李佳航;基于數(shù)據(jù)倉庫的銀行中間業(yè)務系統(tǒng)研究[D];廈門大學;2008年

3 王R，

本文編號：1853721

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1853721.html

上一篇：彈光調(diào)制干涉圖的預處理及相位校正方法
下一篇：面向Mashup的服務開發(fā)環(huán)境分析與仿真實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

企業(yè)環(huán)境下分布式數(shù)據(jù)倉庫的設計與優(yōu)化技術的研究