面向大數(shù)據(jù)的聚類挖掘算法研究
發(fā)布時間:2018-05-06 23:37
本文選題:大數(shù)據(jù) + 聚類挖掘。 參考:《南京郵電大學(xué)》2015年碩士論文
【摘要】:大數(shù)據(jù)巨大的潛在價值促使大數(shù)據(jù)挖掘技術(shù)的產(chǎn)生,大數(shù)據(jù)挖掘是指從具有大規(guī)模性、高速性和多樣性的數(shù)據(jù)源中挖掘出有價值知識的數(shù)據(jù)處理過程;如何準(zhǔn)確、快速地從大數(shù)據(jù)中挖掘出有價值的知識是當(dāng)今的研究熱點。本文將面向大數(shù)據(jù)的聚類挖掘算法作為研究重點,以提高聚類挖掘算法的準(zhǔn)確度和效率為研究目標(biāo),首先對傳統(tǒng)聚類挖掘算法進行改進以提高準(zhǔn)確度,然后對改進的聚類算法并行化以提高效率。為了提高聚類的準(zhǔn)確度,本文在DBSCAN算法和k-means算法的基礎(chǔ)之上,提出了基于密度的增量k-means聚類算法(Density-based Incremental k-means,DBIK-means)。DBIK-means算法首先計算數(shù)據(jù)點的密度,以密度不小于給定閾值的中心點以及在其密度范圍內(nèi)的點組合成各個基本簇;再依據(jù)兩個簇中心點之間的距離合并基本簇;最后把沒有劃分到任意簇的點劃分到與其距離最近的簇中。理論分析和基于KDD CUP 99數(shù)據(jù)集的實驗結(jié)果表明了該算法能夠發(fā)現(xiàn)任意形狀的簇,對數(shù)據(jù)點的輸入順序以及參數(shù)不敏感,在時間開銷僅略有增加的情況下可獲得更高的聚類準(zhǔn)確度,其總體性能優(yōu)于k-means。為了提高DBIK-means算法的效率,降低算法的時間復(fù)雜度,本文利用分布式數(shù)據(jù)庫來模擬共享存儲空間,在云計算Hadoop平臺上進行DBIK-means算法的并行化;通過仿真實驗進行驗證,實驗結(jié)果表明DBIK-means算法適合大規(guī)模數(shù)據(jù)集的聚類挖掘。本文最后將DBIK-means聚類算法應(yīng)用于電信客戶的分類中,應(yīng)用結(jié)果表明該聚類算法能夠較為準(zhǔn)確地將大量的電信客戶自動劃分到若干簇中,為電信運營商針對不同類型的客戶制定不同的營銷策略提供幫助。
[Abstract]:Big data's enormous potential value promotes the generation of big data mining technology. Big data mining refers to the data processing process of mining valuable knowledge from large-scale, high-speed and diverse data sources. Quickly excavating valuable knowledge from big data is a hot research topic. In this paper, we focus on the clustering mining algorithm for big data, aiming at improving the accuracy and efficiency of the clustering mining algorithm. Firstly, we improve the traditional clustering mining algorithm to improve the accuracy. Then the improved clustering algorithm is parallelized to improve the efficiency. In order to improve the accuracy of clustering, this paper proposes an incremental k-means clustering algorithm based on density based Incremental k-means.DBIK-means algorithm, which is based on the DBSCAN algorithm and the k-means algorithm. Firstly, the density of the data points is calculated by using the Dens-based Incremental k-means.DBIK-means algorithm. Each basic cluster is composed of the center point whose density is not less than a given threshold and the point in the range of its density, and then the basic cluster is merged according to the distance between the center points of the two clusters. Finally, the points which are not partitioned into arbitrary clusters are divided into the clusters nearest to them. Theoretical analysis and experimental results based on KDD CUP 99 data set show that the algorithm can find clusters with arbitrary shapes and is insensitive to the input order and parameters of data points. When the time cost is only slightly increased, higher clustering accuracy can be obtained, and its overall performance is better than that of k-means. In order to improve the efficiency of DBIK-means algorithm and reduce the time complexity of the algorithm, this paper uses distributed database to simulate shared storage space and parallelize DBIK-means algorithm on cloud computing Hadoop platform. Experimental results show that DBIK-means algorithm is suitable for clustering mining of large scale data sets. Finally, the DBIK-means clustering algorithm is applied to the classification of telecom customers. The application results show that the clustering algorithm can automatically divide a large number of telecom customers into a number of clusters accurately. Telecom operators for different types of customers to develop different marketing strategies to help.
【學(xué)位授予單位】:南京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP311.13
【參考文獻】
相關(guān)期刊論文 前2條
1 趙衛(wèi)中;馬慧芳;傅燕翔;史忠植;;基于云計算平臺Hadoop的并行k-means聚類算法設(shè)計研究[J];計算機科學(xué);2011年10期
2 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學(xué)報;2008年01期
,本文編號:1854428
本文鏈接:http://sikaile.net/guanlilunwen/yingxiaoguanlilunwen/1854428.html
最近更新
教材專著