k-means聚類算法的改進(jìn)研究及應(yīng)用
本文選題:改進(jìn)k-means算法 + BWP指標(biāo)值; 參考:《蘭州交通大學(xué)》2017年碩士論文
【摘要】:數(shù)據(jù)挖掘是從大量、雜亂無(wú)章的數(shù)據(jù)中,提取到深層且有價(jià)值信息的過(guò)程。數(shù)據(jù)挖掘應(yīng)用涉及到多種技術(shù),主要包括聚類、分類、關(guān)聯(lián)以及預(yù)測(cè)控制等方面。其中,聚類分析是數(shù)據(jù)挖掘的一個(gè)重要方向,是一個(gè)把數(shù)據(jù)集對(duì)象劃分成不相容子集的過(guò)程。目前,聚類分析已經(jīng)廣泛地運(yùn)用于很多領(lǐng)域,如Web搜索、人工智能、信息檢索、圖像模式識(shí)別、空間數(shù)據(jù)庫(kù)技術(shù)和市場(chǎng)營(yíng)銷等。目前,被人們熟知且廣泛使用的聚類方法有:劃分方法、層次方法、基于密度的方法、基于網(wǎng)格的方法和基于概率模型的方法[1]。k-means算法是常用的劃分聚類算法,具有原理簡(jiǎn)單、便于理解和實(shí)現(xiàn)、能處理大數(shù)據(jù)集等優(yōu)點(diǎn)。給定訓(xùn)練數(shù)據(jù)集和聚類數(shù),該算法即可依據(jù)準(zhǔn)則函數(shù)將數(shù)據(jù)集迭代聚類,直到函數(shù)不再發(fā)生變化或達(dá)到約定的閾值為止。該算法的缺點(diǎn)主要有:聚類數(shù)需要事先給定,聚類結(jié)果對(duì)選取的初始中心點(diǎn)和數(shù)據(jù)集中的噪聲點(diǎn)敏感和聚類結(jié)果可能是局部最優(yōu)解等。本文主要針對(duì)k-means算法中聚類數(shù)需要事先給定、初始中心點(diǎn)的選取對(duì)聚類結(jié)果影響較大以及聚類結(jié)果對(duì)異常點(diǎn)敏感這三方面的缺點(diǎn)做出了相應(yīng)改進(jìn),提出了一種改進(jìn)的基于最大最小距離的k-means聚類算法。該算法在利用最大最小距離方法時(shí),先利用分治算法思想把參數(shù)值θ所在的理論區(qū)間分解成較小區(qū)間,在每一個(gè)小區(qū)間上選取一個(gè)數(shù)作為θ值,依據(jù)不同的θ值分別對(duì)數(shù)據(jù)集進(jìn)行聚類,去掉聚類效果不好的區(qū)間,然后利用連續(xù)屬性離散化的思想對(duì)剩余區(qū)間進(jìn)行離散,θ取遍離散化后的區(qū)間端點(diǎn)值,對(duì)數(shù)據(jù)集進(jìn)行聚類,利用95%的有序BWP指標(biāo)值的均值來(lái)衡量聚類結(jié)果,均值越大,說(shuō)明聚類效果越好,最大的均值對(duì)應(yīng)著最好的聚類結(jié)果。該改進(jìn)算法解決了k-means聚類算法的聚類數(shù)需要事先給定、對(duì)初始中心點(diǎn)的選取和異常點(diǎn)較敏感的問(wèn)題。為驗(yàn)證改進(jìn)算法的有效性,文章選取UCI數(shù)據(jù)庫(kù)中的三個(gè)數(shù)據(jù)集,并分別用不同的聚類算法進(jìn)行分析,結(jié)果表明改進(jìn)算法準(zhǔn)確率更高,具有更好的聚類效果。最后,文章選取浙江省杭州市部分電信用戶數(shù)據(jù)集為研究對(duì)象,一方面,利用傳統(tǒng)k-means算法、基于最大最小距離的k-means算法和改進(jìn)k-means算法分別對(duì)其進(jìn)行聚類分析,結(jié)果表明改進(jìn)算法聚類效果更好,類簇間差異更明顯;同時(shí),針對(duì)不同類別群體進(jìn)行特征總結(jié)分析,定義類別名稱,并制定差異化的營(yíng)銷方案,以此來(lái)提高行業(yè)服務(wù)質(zhì)量。另一方面,根據(jù)logistic建模步驟及方法,本文利用歷史數(shù)據(jù)訓(xùn)練logistic分類模型,對(duì)細(xì)分人群進(jìn)行流失率預(yù)測(cè),以便企業(yè)提前做好對(duì)流失用戶的挽留措施。
[Abstract]:Data mining is the process of extracting deep and valuable information from a lot of messy data. Data mining applications involve a variety of technologies, including clustering, classification, association and predictive control. Among them, clustering analysis is an important direction of data mining, and it is a process of dividing dataset objects into incompatible subsets. At present, clustering analysis has been widely used in many fields, such as Web search, artificial intelligence, information retrieval, image pattern recognition, spatial database technology and marketing. At present, the widely used clustering methods are as follows: partitioning method, hierarchical method, density-based method, grid-based method and probabilistic model-based method [1] .k-means algorithm. Easy to understand and implement, can deal with big data set and other advantages. Given the training data set and the clustering number, the algorithm can cluster the data set iteratively according to the criterion function until the function no longer changes or reaches the agreed threshold. The main disadvantages of this algorithm are that the number of clusters needs to be given beforehand, the clustering results are sensitive to the selected initial center points and the noise points in the data sets, and the clustering results may be local optimal solutions, etc. In this paper, the clustering number needs to be given in the k-means algorithm, the selection of the initial center has a great influence on the clustering results and the clustering results are sensitive to the outliers. An improved k-means clustering algorithm based on maximum and minimum distance is proposed. When the maximum and minimum distance method is used, the theoretical interval in which the parameter value 胃 is decomposed into smaller intervals, and a number is selected as the 胃 value in each interval. According to the different 胃 values, the data sets are clustered separately to remove the regions with poor clustering effect, then the remaining intervals are discretized by the idea of continuous attribute discretization, and the data sets are clustered according to the values of the end points of the interval after 胃 is discretized. The average value of 95% ordered BWP index is used to measure the clustering result. The larger the average value is, the better the clustering effect is, and the maximum mean value corresponds to the best clustering result. The improved algorithm solves the problem that the clustering number of k-means clustering algorithm needs to be given beforehand and sensitive to the selection of initial center points and outliers. In order to verify the effectiveness of the improved algorithm, three datasets in UCI database are selected and analyzed with different clustering algorithms. The results show that the improved algorithm has higher accuracy and better clustering effect. Finally, this paper selects some telecom data sets in Hangzhou, Zhejiang Province as the research object. On the one hand, the traditional k-means algorithm, the k-means algorithm based on the maximum and minimum distance and the improved k-means algorithm are used to cluster the data sets. The results show that the improved clustering algorithm is more effective and the difference between clusters is more obvious. At the same time, the characteristics of different groups are summarized and analyzed, category names are defined, and differentiated marketing schemes are formulated to improve the service quality of the industry. On the other hand, according to the steps and methods of logistic modeling, this paper uses historical data to train the logistic classification model to predict the loss rate of the subdivided population, so that enterprises can do a good job of retaining the lost users in advance.
【學(xué)位授予單位】:蘭州交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 田琿;;移動(dòng)行業(yè)集團(tuán)客戶價(jià)值評(píng)估模型的應(yīng)用研究[J];現(xiàn)代工業(yè)經(jīng)濟(jì)和信息化;2016年23期
2 蹤鋒;程林;;K-means算法在物流快遞企業(yè)客戶細(xì)分中的應(yīng)用[J];中國(guó)市場(chǎng);2016年36期
3 魏瑾;;基于客戶細(xì)分的電信聚類市場(chǎng)營(yíng)銷策略研究[J];中國(guó)市場(chǎng);2016年31期
4 梁霄波;;電信客戶細(xì)分中基于聚類算法的數(shù)據(jù)挖掘技術(shù)研究[J];現(xiàn)代電子技術(shù);2016年15期
5 左倪娜;;基于改進(jìn)遺傳算法的K-means聚類方法[J];軟件導(dǎo)刊;2016年04期
6 方匡南;范新妍;馬雙鴿;;基于網(wǎng)絡(luò)結(jié)構(gòu)Logistic模型的企業(yè)信用風(fēng)險(xiǎn)預(yù)警[J];統(tǒng)計(jì)研究;2016年04期
7 楊曉斌;毛雪岷;;聚類分析在電信客戶細(xì)分中的應(yīng)用[J];鄂州大學(xué)學(xué)報(bào);2015年07期
8 何坤金;;分治算法的探討及應(yīng)用[J];福建電腦;2015年04期
9 方方;王子英;;K-means聚類分析在人體體型分類中的應(yīng)用[J];東華大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年05期
10 曹樹(shù)國(guó);;基于考場(chǎng)編排的改進(jìn)分治混洗算法研究[J];計(jì)算機(jī)應(yīng)用與軟件;2014年06期
相關(guān)博士學(xué)位論文 前1條
1 周世兵;聚類分析中的最佳聚類數(shù)確定方法研究及應(yīng)用[D];江南大學(xué);2011年
相關(guān)碩士學(xué)位論文 前5條
1 宋建林;K-means聚類算法的改進(jìn)研究[D];安徽大學(xué);2016年
2 王帥宇;K-Means算法在用戶細(xì)分方面的應(yīng)用研究[D];北京理工大學(xué);2015年
3 董騏瑞;k-均值聚類算法的改進(jìn)與實(shí)現(xiàn)[D];吉林大學(xué);2015年
4 劉鳳芹;K-means聚類算法改進(jìn)研究[D];山東師范大學(xué);2013年
5 吳曉蓉;K-均值聚類算法初始中心選取相關(guān)問(wèn)題的研究[D];湖南大學(xué);2008年
,本文編號(hào):2063796
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2063796.html