Compatible Study of Hadoop for Efficient Analyzing and Proce
發(fā)布時(shí)間:2021-01-02 04:02
在利用計(jì)算機(jī)的同時(shí),數(shù)據(jù)不斷產(chǎn)生和積累。導(dǎo)致的問題是在哪里保存這些數(shù)據(jù)?過去解決此問題,存儲(chǔ)成本過大。然而,由于近來技術(shù)的發(fā)展,存儲(chǔ)費(fèi)用已減少。大數(shù)據(jù)是數(shù)據(jù)集的集合,而數(shù)據(jù)集的規(guī)模更大且涉及面更廣,使用傳統(tǒng)的數(shù)據(jù)庫(kù)管理工具很難處理。同時(shí),使用傳統(tǒng)方法處理大量數(shù)據(jù)集非常耗時(shí),因此,比傳統(tǒng)方法更快,效率更高的Hadoop框架被廣泛使用。主要目標(biāo)是對(duì)不斷產(chǎn)生的數(shù)據(jù)進(jìn)行處理,效率更高,耗時(shí)更少,并且不用存儲(chǔ)數(shù)據(jù)。數(shù)據(jù)主要分為三類:結(jié)構(gòu)化數(shù)據(jù)、非結(jié)構(gòu)化數(shù)據(jù)和半結(jié)構(gòu)化數(shù)據(jù)。為了處理這些巨大的數(shù)據(jù)集,Hadoop中提供了不同類型的框架。我們主要關(guān)注Pig、Hive和Impala這三個(gè)不同的框架,圍繞如何有效分析結(jié)構(gòu)化數(shù)據(jù)集并減少結(jié)構(gòu)化數(shù)據(jù)集的時(shí)間消耗展開系統(tǒng)研究。我們通過將三種Hadoop框架應(yīng)用于兩個(gè)不同的數(shù)據(jù)集進(jìn)行實(shí)驗(yàn)比較,檢查數(shù)據(jù)處理效率。具體來說,我們?cè)贖ive,Pig和Impala上執(zhí)行類似的任務(wù)并完成實(shí)驗(yàn)結(jié)果評(píng)測(cè)。結(jié)果表明,Impala比Hive和Pig效率更高,因?yàn)閳?zhí)行任務(wù)所需的時(shí)間更少。
【文章來源】:西南科技大學(xué)四川省
【文章頁(yè)數(shù)】:59 頁(yè)
【學(xué)位級(jí)別】:碩士
【文章目錄】:
摘要
Abstract
CHAPTER1 INTRODUCTION
1.1 Introduction
1.2 Big Data Definitions
1.3 Research Background
1.3.1 Big Data Applications
1.3.2 Challenges of Big Data
1.3.3 Apache Hadoop
1.3.4 Hadoop Environment
1.3.5 Hadoop Architecture and Design
1.3.6 Hadoop Distributed File System(HDFS)
1.3.7 MapReduce
1.3.8 Hadoop Ecosystem
1.4 Objective of Research
1.5 Contributions and Significance of Research
CHAPTER2 Related Work/Review of Literature
2.1 INTRODUCTION
2.2 Review of Literature
Chapter3 Methodology
3.1 Completely Unstructured Data
3.2 Semi-Structured Data
3.3 Structured Data
3.4 Estimation Technique
3.5 Apache PIG-based Calculating
3.6 Apache HIVE-based Data Storage
3.7 Apache IMPALA-based Data Management
Chapter4 Experiment and Results
4.1 Dataset
4.2 System Requirements
4.3 Apache Pig
4.3.1 Contents of our Input File
4.3.2 Copying the Input File
4.3.3 Executing the Pig commands on File
4.3.4 Mapper and Reducer Running Job
4.3.5 Output
4.4 Apache Hive
4.4.1 Create Table and Loading the Data
4.4.2 Query Execution
4.4.3 Mapper and Reducer Running Job
4.5 Apache Impala
4.5.1 Contents of Input File
4.5.2 Create Table and Loading the Data
4.5.3 Query Execution
4.5.4 Output
4.6 Comparison of Results(Pig,Hive Impala)
Chapter5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
Reference
ACKNOWLEDGEMENTS
Academic Achievements
DEDICATION
本文編號(hào):2952612
【文章來源】:西南科技大學(xué)四川省
【文章頁(yè)數(shù)】:59 頁(yè)
【學(xué)位級(jí)別】:碩士
【文章目錄】:
摘要
Abstract
CHAPTER1 INTRODUCTION
1.1 Introduction
1.2 Big Data Definitions
1.3 Research Background
1.3.1 Big Data Applications
1.3.2 Challenges of Big Data
1.3.3 Apache Hadoop
1.3.4 Hadoop Environment
1.3.5 Hadoop Architecture and Design
1.3.6 Hadoop Distributed File System(HDFS)
1.3.7 MapReduce
1.3.8 Hadoop Ecosystem
1.4 Objective of Research
1.5 Contributions and Significance of Research
CHAPTER2 Related Work/Review of Literature
2.1 INTRODUCTION
2.2 Review of Literature
Chapter3 Methodology
3.1 Completely Unstructured Data
3.2 Semi-Structured Data
3.3 Structured Data
3.4 Estimation Technique
3.5 Apache PIG-based Calculating
3.6 Apache HIVE-based Data Storage
3.7 Apache IMPALA-based Data Management
Chapter4 Experiment and Results
4.1 Dataset
4.2 System Requirements
4.3 Apache Pig
4.3.1 Contents of our Input File
4.3.2 Copying the Input File
4.3.3 Executing the Pig commands on File
4.3.4 Mapper and Reducer Running Job
4.3.5 Output
4.4 Apache Hive
4.4.1 Create Table and Loading the Data
4.4.2 Query Execution
4.4.3 Mapper and Reducer Running Job
4.5 Apache Impala
4.5.1 Contents of Input File
4.5.2 Create Table and Loading the Data
4.5.3 Query Execution
4.5.4 Output
4.6 Comparison of Results(Pig,Hive Impala)
Chapter5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
Reference
ACKNOWLEDGEMENTS
Academic Achievements
DEDICATION
本文編號(hào):2952612
本文鏈接:http://sikaile.net/kejilunwen/shengwushengchang/2952612.html
最近更新
教材專著