基于Hadoop的通信行業(yè)大數(shù)據(jù)分析挖掘技術(shù)研究與實現(xiàn)

發(fā)布時間：2019-04-12 06:20

【摘要】：隨著信息技術(shù)的發(fā)展,產(chǎn)生的數(shù)據(jù)規(guī)模在急劇擴大,面對如此海量的數(shù)據(jù),數(shù)據(jù)挖掘相關(guān)技術(shù)也隨之發(fā)展。面對海量數(shù)據(jù)既有挑戰(zhàn)也有機遇,如何從如此大量的數(shù)據(jù)中挖掘出有用的信息,是一項具有挑戰(zhàn)性的任務(wù)。在通信行業(yè)存在大量的客戶數(shù)據(jù),利用大數(shù)據(jù)相關(guān)技術(shù)對這些數(shù)據(jù)進行分析挖掘,挖掘出潛在的知識,以提高服務(wù)體驗是一項有意義的任務(wù)。本文在此背景設(shè)下所做的工作如下:首先對算法進行了研究和改進,利用聚類算法實現(xiàn)客戶細分,使用決策樹算法進行客戶預測。傳統(tǒng)的K-means算法需要輸入聚類數(shù)目,而而對如此海量數(shù)據(jù)并不清楚數(shù)據(jù)的分布情況,這對使用此算法帶來了困難,針對這些不足,本文對K-means聚類算法進行了改進,實現(xiàn)了一和了DGK-means算法,利用遺傳算法來計算最合適的聚類數(shù)目,同時使用基于密度的思想計算遺傳算法中的適應(yīng)度函數(shù),提高了算法效率和準確度。使用C4.5決策樹算法構(gòu)造決策樹模型,使用此模型預測未知結(jié)果的數(shù)據(jù),達到客戶預測和客戶挽留的目標。其次使用Hadoop平臺進行大數(shù)據(jù)的分析和挖掘,設(shè)計并實現(xiàn)了基于Hadoop的通信行業(yè)大數(shù)據(jù)分析挖掘系統(tǒng),使用HDFS對數(shù)據(jù)進行分布式存儲和MapReduce編程模型對算法進行并行化計算。在算法層對算法分別進行了并行化設(shè)計,提高了效率。最后本文使用測試數(shù)據(jù)集對系統(tǒng)和算法的性能進行了驗證,表明設(shè)計的DGK-means算法的準確度和效率相比較傳統(tǒng)算法均得到了提高;并行化計算在集群節(jié)點數(shù)目大于2的情況下效率得到了提高,并且隨著集群節(jié)點數(shù)目的增加效率提高越明顯。
[Abstract]:With the development of information technology, the scale of data is expanding rapidly. In the face of such a huge amount of data, data mining technology is also developed. Faced with both challenges and opportunities, how to mine useful information from such a large amount of data is a challenging task. There is a large amount of customer data in the communication industry. It is a meaningful task to analyze and mine these data by using big data's related technology to find out the potential knowledge in order to improve the service experience. Under this background, the work done in this paper is as follows: firstly, the algorithm is studied and improved, the clustering algorithm is used to achieve customer segmentation, and the decision tree algorithm is used to predict the customer. The traditional K-means algorithm needs to input the number of clusters, but for such a large amount of data does not know the distribution of the data, which brings difficulties to use this algorithm, in view of these shortcomings, this paper has improved the K-means clustering algorithm. The one-sum DGK-means algorithm is implemented. The genetic algorithm is used to calculate the most suitable number of clusters, and the fitness function of the genetic algorithm is calculated by using the density-based idea, which improves the efficiency and accuracy of the algorithm. The C4.5 decision tree algorithm is used to construct the decision tree model. The model is used to predict the data of unknown results to achieve the goal of customer prediction and customer retention. Secondly, the Hadoop platform is used to analyze and mine big data, and the big data analysis and mining system based on Hadoop is designed and implemented. HDFS is used for distributed storage of data and MapReduce programming model is used for parallel calculation of the algorithm. In the algorithm layer, the parallel design of the algorithm is carried out to improve the efficiency. Finally, the test data set is used to verify the performance of the system and the algorithm. It is shown that the accuracy and efficiency of the designed DGK-means algorithm are improved compared with the traditional algorithm. The efficiency of parallel computing is improved when the number of cluster nodes is greater than 2, and the efficiency increases more obviously with the increase of the number of cluster nodes.
【學位授予單位】：北京郵電大學
【學位級別】：碩士
【學位授予年份】：2016
【分類號】：TP311.13

【參考文獻】

相關(guān)期刊論文前10條

1 牛怡晗;海沫;;Hadoop平臺下Mahout聚類算法的比較研究[J];計算機科學;2015年S1期

2 張引;陳敏;廖小飛;;大數(shù)據(jù)應(yīng)用的現(xiàn)狀與展望[J];計算機研究與發(fā)展;2013年S2期

3 王元卓;靳小龍;程學旗;;網(wǎng)絡(luò)大數(shù)據(jù):現(xiàn)狀與展望[J];計算機學報;2013年06期

4 張石磊;武裝;;一種基于Hadoop云計算平臺的聚類算法優(yōu)化的研究[J];計算機科學;2012年S2期

5 彭凱;秦永彬;許道云;;應(yīng)用因子分析和K-MEANS聚類的客戶分群建模[J];計算機科學;2011年05期

6 山拜·達拉拜;曹紅麗;尤努斯·艾沙;;基于遺傳算法的K-means初始化EM算法及聚類應(yīng)用[J];現(xiàn)代電子技術(shù);2010年15期

7 雷小鋒;謝昆青;林帆;夏征義;;一種基于K-Means局部最優(yōu)性的高效聚類算法[J];軟件學報;2008年07期

8 劉光遠;苑森淼;董立巖;;數(shù)據(jù)挖掘方法在用戶流失預測分析中的應(yīng)用[J];計算機工程與應(yīng)用;2007年09期

9 張賓;賀昌政;;自組織數(shù)據(jù)挖掘方法研究綜述[J];哈爾濱工業(yè)大學學報;2006年10期

10 吳志勇;吳躍;;數(shù)據(jù)挖掘在電信業(yè)中的應(yīng)用研究[J];計算機應(yīng)用;2005年S1期

相關(guān)碩士學位論文前1條

1 黎光譜;改進K-Means聚類算法在基于Hadoop平臺的圖像檢索系統(tǒng)中的研究與實現(xiàn)[D];廈門大學;2014年

，

本文編號：2456765

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2456765.html

上一篇：基于Revit的鋼筋混凝土結(jié)構(gòu)信息提取研究
下一篇：基于Halcon的閥芯尺寸亞像素級測量

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop的通信行業(yè)大數(shù)據(jù)分析挖掘技術(shù)研究與實現(xiàn)