基于Hadoop的通信行業(yè)大數(shù)據(jù)分析挖掘技術(shù)研究與實現(xiàn)
[Abstract]:With the development of information technology, the scale of data is expanding rapidly. In the face of such a huge amount of data, data mining technology is also developed. Faced with both challenges and opportunities, how to mine useful information from such a large amount of data is a challenging task. There is a large amount of customer data in the communication industry. It is a meaningful task to analyze and mine these data by using big data's related technology to find out the potential knowledge in order to improve the service experience. Under this background, the work done in this paper is as follows: firstly, the algorithm is studied and improved, the clustering algorithm is used to achieve customer segmentation, and the decision tree algorithm is used to predict the customer. The traditional K-means algorithm needs to input the number of clusters, but for such a large amount of data does not know the distribution of the data, which brings difficulties to use this algorithm, in view of these shortcomings, this paper has improved the K-means clustering algorithm. The one-sum DGK-means algorithm is implemented. The genetic algorithm is used to calculate the most suitable number of clusters, and the fitness function of the genetic algorithm is calculated by using the density-based idea, which improves the efficiency and accuracy of the algorithm. The C4.5 decision tree algorithm is used to construct the decision tree model. The model is used to predict the data of unknown results to achieve the goal of customer prediction and customer retention. Secondly, the Hadoop platform is used to analyze and mine big data, and the big data analysis and mining system based on Hadoop is designed and implemented. HDFS is used for distributed storage of data and MapReduce programming model is used for parallel calculation of the algorithm. In the algorithm layer, the parallel design of the algorithm is carried out to improve the efficiency. Finally, the test data set is used to verify the performance of the system and the algorithm. It is shown that the accuracy and efficiency of the designed DGK-means algorithm are improved compared with the traditional algorithm. The efficiency of parallel computing is improved when the number of cluster nodes is greater than 2, and the efficiency increases more obviously with the increase of the number of cluster nodes.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP311.13
【參考文獻】
相關(guān)期刊論文 前10條
1 牛怡晗;海沫;;Hadoop平臺下Mahout聚類算法的比較研究[J];計算機科學;2015年S1期
2 張引;陳敏;廖小飛;;大數(shù)據(jù)應用的現(xiàn)狀與展望[J];計算機研究與發(fā)展;2013年S2期
3 王元卓;靳小龍;程學旗;;網(wǎng)絡大數(shù)據(jù):現(xiàn)狀與展望[J];計算機學報;2013年06期
4 張石磊;武裝;;一種基于Hadoop云計算平臺的聚類算法優(yōu)化的研究[J];計算機科學;2012年S2期
5 彭凱;秦永彬;許道云;;應用因子分析和K-MEANS聚類的客戶分群建模[J];計算機科學;2011年05期
6 山拜·達拉拜;曹紅麗;尤努斯·艾沙;;基于遺傳算法的K-means初始化EM算法及聚類應用[J];現(xiàn)代電子技術(shù);2010年15期
7 雷小鋒;謝昆青;林帆;夏征義;;一種基于K-Means局部最優(yōu)性的高效聚類算法[J];軟件學報;2008年07期
8 劉光遠;苑森淼;董立巖;;數(shù)據(jù)挖掘方法在用戶流失預測分析中的應用[J];計算機工程與應用;2007年09期
9 張賓;賀昌政;;自組織數(shù)據(jù)挖掘方法研究綜述[J];哈爾濱工業(yè)大學學報;2006年10期
10 吳志勇;吳躍;;數(shù)據(jù)挖掘在電信業(yè)中的應用研究[J];計算機應用;2005年S1期
相關(guān)碩士學位論文 前1條
1 黎光譜;改進K-Means聚類算法在基于Hadoop平臺的圖像檢索系統(tǒng)中的研究與實現(xiàn)[D];廈門大學;2014年
,本文編號:2456765
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2456765.html