基于Hadoop的物流歷史數(shù)據(jù)聚類挖掘研究

發(fā)布時(shí)間：2018-04-08 13:14

本文選題：Hadoop　切入點(diǎn)：Canopy-Kmeans　出處：《西安工業(yè)大學(xué)》2017年碩士論文

【摘要】：隨著電商、物聯(lián)網(wǎng)、云計(jì)算等一系列新型技術(shù)的發(fā)展與應(yīng)用,如今的物流行業(yè)的數(shù)據(jù)增長已不再是線性的、緩慢的,它所呈現(xiàn)的是海量的、復(fù)雜的、實(shí)時(shí)的與爆炸性的。顯然,傳統(tǒng)的單機(jī)存儲(chǔ)和串行的數(shù)據(jù)挖掘技術(shù)已無法滿足當(dāng)前物流行業(yè)的大數(shù)據(jù)處理需求。Hadoop則依然成為了當(dāng)下社會(huì)發(fā)展的新趨勢,它是一個(gè)開源的分布式平臺(tái),適用于大數(shù)據(jù)集的分布式計(jì)算。近年來,這一技術(shù)在數(shù)據(jù)挖掘領(lǐng)域逐漸發(fā)揮出其獨(dú)特的優(yōu)勢。而K-means聚類算法就是一種有效的大數(shù)據(jù)挖掘算法,該算法實(shí)現(xiàn)簡單且易于使用,但其在質(zhì)心點(diǎn)及K值的選取上仍然存在很大的盲目性和不可預(yù)見性,經(jīng)常導(dǎo)致聚類結(jié)果出現(xiàn)局部最優(yōu),且在距離計(jì)算過程中存在著復(fù)雜的冗余計(jì)算,收斂速度慢,聚類精度低,缺乏并行性和可擴(kuò)展性,大大降低了算法的運(yùn)行效率。針對傳統(tǒng)K-means算法不足,本課題結(jié)合了“距離三角不等式原理”和“最小最大原則”的優(yōu)點(diǎn),在Hadoop云計(jì)算平臺(tái)上提出了一種基于雙MapReduce分布式編程模型改進(jìn)的Canopy-Kmeans算法,并通過社發(fā)物流公司的真實(shí)歷史數(shù)據(jù)驗(yàn)證了本文算法的正確性。具體的研究工作如下:首先,本文詳細(xì)闡述了Hadoop生態(tài)系統(tǒng),對其基本組件、構(gòu)造模塊以及工作機(jī)制進(jìn)行了深入的剖析和研究;分析了大數(shù)據(jù)挖掘過程的標(biāo)準(zhǔn)流程;對傳統(tǒng)K-means算法的設(shè)計(jì)思路和過程進(jìn)行了深入的研究,探討了已有研究成果的優(yōu)缺點(diǎn)。其次,為了優(yōu)化K值的選中問題,在Hadoop平臺(tái)上基于最小最大原則對傳統(tǒng)的Canopy算法進(jìn)行了改進(jìn),成功地解決了傳統(tǒng)Canopy算法中人為設(shè)置K值以及區(qū)域半徑T1、T2的盲目性,為K-means聚類結(jié)果的準(zhǔn)確性提供了可靠的理論依據(jù)。再次,為了解決傳統(tǒng)K-means算法在迭代過程中存在的大量冗余計(jì)算,本文基于三角不等式原理的優(yōu)點(diǎn),在K-means算法迭代計(jì)算之前,增加了距離篩選判定,從而有效地減少了大量的冗余計(jì)算;另外,為了進(jìn)一步提高該算法的運(yùn)行效率,本文還在引入加權(quán)聚類準(zhǔn)則函數(shù)的基礎(chǔ)上,增加了收斂性判定,進(jìn)而提高了聚類的質(zhì)量和收斂速度,降低了數(shù)據(jù)對象的誤分率。最后,設(shè)計(jì)并實(shí)現(xiàn)了基于雙MapReduce編程模型改進(jìn)的Canopy-Kmeans算法。為了進(jìn)一步驗(yàn)證本文算法設(shè)計(jì)的可行性,搭建了Hadoop集群環(huán)境,以尋找社發(fā)物流公司的關(guān)鍵客戶群體為例進(jìn)行了大量的實(shí)驗(yàn)。實(shí)驗(yàn)結(jié)果表明,設(shè)計(jì)的并行算法在聚類結(jié)果的準(zhǔn)確性、加速比、擴(kuò)展性等方面都有顯著的提高。成功地解決了K值及Canopy中心點(diǎn)選中存在的問題,避免了迭代過程中冗余的距離計(jì)算,提高了原算法的收斂速度,并且數(shù)據(jù)規(guī)模越大、節(jié)點(diǎn)越多,改進(jìn)的效果就越顯著。
[Abstract]:With the development and application of a series of new technologies, such as e-commerce, Internet of things, cloud computing and so on, the data growth of the logistics industry is no longer linear, slow, it presents massive, complex, real-time and explosive.Obviously, the traditional single-machine storage and serial data mining technology can not meet the current logistics industry big data processing needs. Hadoop is still a new trend of social development, it is an open source distributed platform.It is suitable for the distributed computing of big data set.In recent years, this technology has gradually played its unique advantage in the field of data mining.The K-means clustering algorithm is an effective big data mining algorithm, which is simple and easy to use, but it still has great blindness and unpredictability in the selection of centroid point and K value.The clustering results often lead to local optimum, and there are complex redundant computation in the distance calculation process. The convergence speed is slow, the clustering accuracy is low, and the algorithm lacks parallelism and expansibility, which greatly reduces the running efficiency of the algorithm.Aiming at the shortage of traditional K-means algorithm, this paper combines the advantages of "distance triangle inequality principle" and "minimum maximum principle", and proposes an improved Canopy-Kmeans algorithm based on dual MapReduce distributed programming model on Hadoop cloud computing platform.The validity of this algorithm is verified by the real historical data of social development logistics company.The specific research work is as follows: first, this paper describes the Hadoop ecosystem in detail, analyzes its basic components, construction modules and working mechanism, analyzes the standard process of big data mining process;The design idea and process of traditional K-means algorithm are deeply studied, and the advantages and disadvantages of existing research results are discussed.Secondly, in order to optimize the selection of K value, the traditional Canopy algorithm is improved on the Hadoop platform based on the principle of minimum and maximum. The blindness of artificial setting K value and region radius T1T 2 in the traditional Canopy algorithm is solved successfully.It provides a reliable theoretical basis for the accuracy of K-means clustering results.Thirdly, in order to solve the large amount of redundant computation in the iterative process of the traditional K-means algorithm, based on the advantage of the triangular inequality principle, the distance filter decision is added before the iterative calculation of the K-means algorithm.In addition, in order to further improve the efficiency of the algorithm, the weighted clustering criterion function is introduced, and the convergence criterion is added.Furthermore, the clustering quality and convergence speed are improved, and the misclassification rate of data objects is reduced.Finally, an improved Canopy-Kmeans algorithm based on double MapReduce programming model is designed and implemented.In order to further verify the feasibility of the algorithm design in this paper, a Hadoop cluster environment is set up, and a large number of experiments are carried out to find the key customer group of Social Development Logistics Company as an example.The experimental results show that the proposed parallel algorithm can improve the accuracy, speedup and expansibility of the clustering results.The problem of K value and Canopy center selection is solved successfully, the redundant distance calculation during iteration is avoided, and the convergence speed of the original algorithm is improved. The larger the data scale is, the more nodes are selected, and the better the result is.
【學(xué)位授予單位】：西安工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 劉寶龍;蘇金;;雙MapReduce改進(jìn)的Canopy-Kmeans算法[J];西安工業(yè)大學(xué)學(xué)報(bào);2016年09期

2 孟海東;任敬佩;;基于云計(jì)算平臺(tái)的聚類算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2015年11期

3 高榕;李晶;肖雅夫;祝孫靜;彭衛(wèi)平;;基于云環(huán)境K-means聚類的并行算法[J];武漢大學(xué)學(xué)報(bào)(理學(xué)版);2015年04期

4 韓巖;李曉;;加速大數(shù)據(jù)聚類K-means算法的改進(jìn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2015年05期

5 武霞;董增壽;孟曉燕;;基于大數(shù)據(jù)平臺(tái)hadoop的聚類算法K值優(yōu)化研究[J];太原科技大學(xué)學(xué)報(bào);2015年02期

6 葉于林;夏秀渝;莫建華;劉帥;;對K-means及勢函數(shù)聚類算法的研究與改進(jìn)[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2015年04期

7 陳欣;胡夢文;陳嬌;;國際物流研究熱點(diǎn)分析[J];物流工程與管理;2014年11期

8 程學(xué)旗;靳小龍;王元卓;郭嘉豐;張鐵贏;李國杰;;大數(shù)據(jù)系統(tǒng)和分析技術(shù)綜述[J];軟件學(xué)報(bào);2014年09期

9 譚躍生;楊寶光;王靜宇;張亞楠;;Hadoop云平臺(tái)下的聚類算法研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2014年05期

10 賈瑞玉;管玉勇;李亞龍;;基于MapReduce模型的并行遺傳k-means聚類算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2014年02期

相關(guān)碩士學(xué)位論文前3條

1 唐振坤;基于Spark的機(jī)器學(xué)習(xí)平臺(tái)設(shè)計(jì)與實(shí)現(xiàn)[D];廈門大學(xué);2014年

2 溫程;并行聚類算法在MapReduce上的實(shí)現(xiàn)[D];浙江大學(xué);2011年

3 何春霞;三角不等式原理對聚類算法的改進(jìn)[D];蘭州大學(xué);2006年

，

本文編號：1721779

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1721779.html

上一篇：基于復(fù)用代碼檢測的缺陷發(fā)現(xiàn)方法
下一篇：海洋大數(shù)據(jù):內(nèi)涵、應(yīng)用及平臺(tái)建設(shè)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop的物流歷史數(shù)據(jù)聚類挖掘研究