基于多維數(shù)據(jù)模型的航空公司機(jī)票結(jié)算數(shù)據(jù)分析
本文關(guān)鍵詞: 機(jī)票結(jié)算數(shù)據(jù) 數(shù)據(jù)倉庫 冰山立方體 位圖索引 分布式計算 數(shù)據(jù)挖掘 出處:《中國民航大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著我國民航業(yè)的蓬勃發(fā)展,越來越多的旅客選擇飛機(jī)作為出行工具。航空公司的客運量正在快速增長,與此同時產(chǎn)生的機(jī)票結(jié)算數(shù)據(jù)也在爆炸式的增長。長期的數(shù)據(jù)積累使得機(jī)票結(jié)算數(shù)據(jù)不僅數(shù)據(jù)維度較多,而且數(shù)據(jù)量較大。同時,機(jī)票結(jié)算數(shù)據(jù)的分析工作使得傳統(tǒng)的BI(Business Intelligence)系統(tǒng)面臨極大的性能挑戰(zhàn),因此構(gòu)建多維機(jī)票結(jié)算數(shù)據(jù)立方體,采用分布式計算技術(shù)加快數(shù)據(jù)的查詢和分析速度具有重大意義。針對BI系統(tǒng)中多維數(shù)據(jù)的冰山立方體聚合計算的時間性能問題,本文提出一種基于位圖索引改進(jìn)的DPBUC_BI(Dynamic Pruning based BUC_BI)算法。該算法利用位圖索引按列組織的特性重新定義BUC(Bottom-Up Computation)算法的分組操作,加快了數(shù)據(jù)的加載和查詢;通過使用邏輯位運算實現(xiàn)聚合計算,提高了算法的計算性能。針對機(jī)票結(jié)算數(shù)據(jù)在部分維度上存在聚集現(xiàn)象增加動態(tài)剪枝策略,在保證算法正確性的情況下,進(jìn)一步提高了冰山立方體計算性能。最后將DPBUC_BI算法應(yīng)用于機(jī)票結(jié)算數(shù)據(jù)的冰山立方體計算中,實驗結(jié)果表明:該算法可以很好地提升計算性能,相對于經(jīng)典BUC算法在時間性能上有較大提高。為了更好地對海量機(jī)票結(jié)算數(shù)據(jù)進(jìn)行存儲和多維度分析,本文使用分布式計算框架來實現(xiàn)傳統(tǒng)的機(jī)票結(jié)算分析平臺。在使用Flume和Sqoop完成數(shù)據(jù)的遷移工作的基礎(chǔ)上構(gòu)建事實星座模型的數(shù)據(jù)倉庫,同時對比了ROC和Parquet兩種存儲格式各自的特點。針對位圖索引空間占用較大的問題,提出使用EWAH(Enhanced Word Aligned Hybrid)算法對位圖索引進(jìn)行壓縮,實現(xiàn)了基于MapReduce模型的多維聚合算法和多維關(guān)聯(lián)規(guī)則挖掘分析算法。實驗結(jié)果表明:分布式機(jī)票結(jié)算數(shù)據(jù)分析平臺不僅可以快速完成簡單統(tǒng)計分析,而且能夠很好地運行并行的關(guān)聯(lián)規(guī)則挖掘分析算法。
[Abstract]:With the rapid development of China's civil aviation industry, more and more passengers choose aircraft as a travel tool. The passenger volume of airlines is growing rapidly. At the same time, the air ticket settlement data is also explosive growth. The long-term accumulation of data makes the ticket settlement data not only more dimensions, but also a large amount of data. At the same time. The analysis of ticket settlement data makes the traditional BI(Business Intelligence system face great performance challenges, so the multidimensional ticket settlement data cube is constructed. It is of great significance to use distributed computing technology to speed up the query and analysis of data. The time performance of iceberg cube aggregation computation of multidimensional data in BI system is discussed. In this paper, an improved DPBUC_BI(Dynamic Pruning based BUCCI based on bitmap index is proposed. Algorithm. The algorithm redefines the grouping operation of the BUC(Bottom-Up Computation algorithm using the property that bitmap indexes are organized by columns. Speed up the data loading and query; The performance of the algorithm is improved by using logical bit operation to achieve aggregate computation. Dynamic pruning strategy is added to some dimensions of ticket settlement data to ensure the correctness of the algorithm. Finally, the DPBUC_BI algorithm is applied to the iceberg cube calculation of air ticket settlement data. The experimental results show that the algorithm can improve the performance of the algorithm. Compared with the classical BUC algorithm, the time performance is greatly improved. In order to better store the massive air ticket settlement data and multi-dimensional analysis. In this paper, the distributed computing framework is used to realize the traditional air ticket settlement and analysis platform, and the data warehouse of the factual constellation model is constructed on the basis of data migration by using Flume and Sqoop. At the same time, the characteristics of two storage formats, ROC and Parquet, are compared. The bitmap index is compressed using EWAH(Enhanced Word Aligned hybrid algorithm. The multi-dimensional aggregation algorithm based on MapReduce model and the multi-dimension association rule mining analysis algorithm are implemented. The experimental results show that:. Distributed ticket settlement data analysis platform can not only quickly complete simple statistical analysis. And the parallel association rule mining analysis algorithm can be run well.
【學(xué)位授予單位】:中國民航大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:F560.5;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 高金敏;樂美龍;曲林遲;;基于離散時間的定價與艙位控制聯(lián)合決策[J];交通運輸工程學(xué)報;2016年06期
2 丁祥武;郭濤;王梅;金冉;;一種大規(guī)模分類數(shù)據(jù)聚類算法及其并行實現(xiàn)[J];計算機(jī)研究與發(fā)展;2016年05期
3 劉越;李錦濤;虎嵩林;;基于代價估計的Hive多維索引分割策略選擇算法[J];計算機(jī)研究與發(fā)展;2016年04期
4 薩初日拉;周國亮;時磊;王劉旺;石鑫;朱永利;;Spark環(huán)境下并行立方體計算方法[J];計算機(jī)應(yīng)用;2016年02期
5 徐海榮;陳閔葉;張興媛;;基于Flume、Kafka、Storm、HDFS的航空維修大數(shù)據(jù)系統(tǒng)[J];上海工程技術(shù)大學(xué)學(xué)報;2015年04期
6 包丹文;華松逸;;基于通達(dá)成本的機(jī)場可達(dá)性水平與客運規(guī)模關(guān)聯(lián)性研究[J];武漢理工大學(xué)學(xué)報(交通科學(xué)與工程版);2015年06期
7 陳永艷;束洪春;董俊;曹璞t,
本文編號:1490368
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1490368.html