基于Hadoop的物流歷史數(shù)據(jù)聚類挖掘研究
本文選題:Hadoop 切入點:Canopy-Kmeans 出處:《西安工業(yè)大學》2017年碩士論文
【摘要】:隨著電商、物聯(lián)網(wǎng)、云計算等一系列新型技術的發(fā)展與應用,如今的物流行業(yè)的數(shù)據(jù)增長已不再是線性的、緩慢的,它所呈現(xiàn)的是海量的、復雜的、實時的與爆炸性的。顯然,傳統(tǒng)的單機存儲和串行的數(shù)據(jù)挖掘技術已無法滿足當前物流行業(yè)的大數(shù)據(jù)處理需求。Hadoop則依然成為了當下社會發(fā)展的新趨勢,它是一個開源的分布式平臺,適用于大數(shù)據(jù)集的分布式計算。近年來,這一技術在數(shù)據(jù)挖掘領域逐漸發(fā)揮出其獨特的優(yōu)勢。而K-means聚類算法就是一種有效的大數(shù)據(jù)挖掘算法,該算法實現(xiàn)簡單且易于使用,但其在質心點及K值的選取上仍然存在很大的盲目性和不可預見性,經常導致聚類結果出現(xiàn)局部最優(yōu),且在距離計算過程中存在著復雜的冗余計算,收斂速度慢,聚類精度低,缺乏并行性和可擴展性,大大降低了算法的運行效率。針對傳統(tǒng)K-means算法不足,本課題結合了“距離三角不等式原理”和“最小最大原則”的優(yōu)點,在Hadoop云計算平臺上提出了一種基于雙MapReduce分布式編程模型改進的Canopy-Kmeans算法,并通過社發(fā)物流公司的真實歷史數(shù)據(jù)驗證了本文算法的正確性。具體的研究工作如下:首先,本文詳細闡述了Hadoop生態(tài)系統(tǒng),對其基本組件、構造模塊以及工作機制進行了深入的剖析和研究;分析了大數(shù)據(jù)挖掘過程的標準流程;對傳統(tǒng)K-means算法的設計思路和過程進行了深入的研究,探討了已有研究成果的優(yōu)缺點。其次,為了優(yōu)化K值的選中問題,在Hadoop平臺上基于最小最大原則對傳統(tǒng)的Canopy算法進行了改進,成功地解決了傳統(tǒng)Canopy算法中人為設置K值以及區(qū)域半徑T1、T2的盲目性,為K-means聚類結果的準確性提供了可靠的理論依據(jù)。再次,為了解決傳統(tǒng)K-means算法在迭代過程中存在的大量冗余計算,本文基于三角不等式原理的優(yōu)點,在K-means算法迭代計算之前,增加了距離篩選判定,從而有效地減少了大量的冗余計算;另外,為了進一步提高該算法的運行效率,本文還在引入加權聚類準則函數(shù)的基礎上,增加了收斂性判定,進而提高了聚類的質量和收斂速度,降低了數(shù)據(jù)對象的誤分率。最后,設計并實現(xiàn)了基于雙MapReduce編程模型改進的Canopy-Kmeans算法。為了進一步驗證本文算法設計的可行性,搭建了Hadoop集群環(huán)境,以尋找社發(fā)物流公司的關鍵客戶群體為例進行了大量的實驗。實驗結果表明,設計的并行算法在聚類結果的準確性、加速比、擴展性等方面都有顯著的提高。成功地解決了K值及Canopy中心點選中存在的問題,避免了迭代過程中冗余的距離計算,提高了原算法的收斂速度,并且數(shù)據(jù)規(guī)模越大、節(jié)點越多,改進的效果就越顯著。
[Abstract]:With the development and application of a series of new technologies, such as e-commerce, Internet of things, cloud computing and so on, the data growth of the logistics industry is no longer linear, slow, it presents massive, complex, real-time and explosive.Obviously, the traditional single-machine storage and serial data mining technology can not meet the current logistics industry big data processing needs. Hadoop is still a new trend of social development, it is an open source distributed platform.It is suitable for the distributed computing of big data set.In recent years, this technology has gradually played its unique advantage in the field of data mining.The K-means clustering algorithm is an effective big data mining algorithm, which is simple and easy to use, but it still has great blindness and unpredictability in the selection of centroid point and K value.The clustering results often lead to local optimum, and there are complex redundant computation in the distance calculation process. The convergence speed is slow, the clustering accuracy is low, and the algorithm lacks parallelism and expansibility, which greatly reduces the running efficiency of the algorithm.Aiming at the shortage of traditional K-means algorithm, this paper combines the advantages of "distance triangle inequality principle" and "minimum maximum principle", and proposes an improved Canopy-Kmeans algorithm based on dual MapReduce distributed programming model on Hadoop cloud computing platform.The validity of this algorithm is verified by the real historical data of social development logistics company.The specific research work is as follows: first, this paper describes the Hadoop ecosystem in detail, analyzes its basic components, construction modules and working mechanism, analyzes the standard process of big data mining process;The design idea and process of traditional K-means algorithm are deeply studied, and the advantages and disadvantages of existing research results are discussed.Secondly, in order to optimize the selection of K value, the traditional Canopy algorithm is improved on the Hadoop platform based on the principle of minimum and maximum. The blindness of artificial setting K value and region radius T1T 2 in the traditional Canopy algorithm is solved successfully.It provides a reliable theoretical basis for the accuracy of K-means clustering results.Thirdly, in order to solve the large amount of redundant computation in the iterative process of the traditional K-means algorithm, based on the advantage of the triangular inequality principle, the distance filter decision is added before the iterative calculation of the K-means algorithm.In addition, in order to further improve the efficiency of the algorithm, the weighted clustering criterion function is introduced, and the convergence criterion is added.Furthermore, the clustering quality and convergence speed are improved, and the misclassification rate of data objects is reduced.Finally, an improved Canopy-Kmeans algorithm based on double MapReduce programming model is designed and implemented.In order to further verify the feasibility of the algorithm design in this paper, a Hadoop cluster environment is set up, and a large number of experiments are carried out to find the key customer group of Social Development Logistics Company as an example.The experimental results show that the proposed parallel algorithm can improve the accuracy, speedup and expansibility of the clustering results.The problem of K value and Canopy center selection is solved successfully, the redundant distance calculation during iteration is avoided, and the convergence speed of the original algorithm is improved. The larger the data scale is, the more nodes are selected, and the better the result is.
【學位授予單位】:西安工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前10條
1 劉寶龍;蘇金;;雙MapReduce改進的Canopy-Kmeans算法[J];西安工業(yè)大學學報;2016年09期
2 孟海東;任敬佩;;基于云計算平臺的聚類算法[J];計算機工程與設計;2015年11期
3 高榕;李晶;肖雅夫;祝孫靜;彭衛(wèi)平;;基于云環(huán)境K-means聚類的并行算法[J];武漢大學學報(理學版);2015年04期
4 韓巖;李曉;;加速大數(shù)據(jù)聚類K-means算法的改進[J];計算機工程與設計;2015年05期
5 武霞;董增壽;孟曉燕;;基于大數(shù)據(jù)平臺hadoop的聚類算法K值優(yōu)化研究[J];太原科技大學學報;2015年02期
6 葉于林;夏秀渝;莫建華;劉帥;;對K-means及勢函數(shù)聚類算法的研究與改進[J];計算機系統(tǒng)應用;2015年04期
7 陳欣;胡夢文;陳嬌;;國際物流研究熱點分析[J];物流工程與管理;2014年11期
8 程學旗;靳小龍;王元卓;郭嘉豐;張鐵贏;李國杰;;大數(shù)據(jù)系統(tǒng)和分析技術綜述[J];軟件學報;2014年09期
9 譚躍生;楊寶光;王靜宇;張亞楠;;Hadoop云平臺下的聚類算法研究[J];計算機工程與設計;2014年05期
10 賈瑞玉;管玉勇;李亞龍;;基于MapReduce模型的并行遺傳k-means聚類算法[J];計算機工程與設計;2014年02期
相關碩士學位論文 前3條
1 唐振坤;基于Spark的機器學習平臺設計與實現(xiàn)[D];廈門大學;2014年
2 溫程;并行聚類算法在MapReduce上的實現(xiàn)[D];浙江大學;2011年
3 何春霞;三角不等式原理對聚類算法的改進[D];蘭州大學;2006年
,本文編號:1721779
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1721779.html