改進(jìn)K-means聚類算法的研究

發(fā)布時(shí)間：2018-03-07 18:21

本文選題：聚類分析　切入點(diǎn)：K-means算法　出處：《安徽大學(xué)》2015年碩士論文　論文類型：學(xué)位論文

【摘要】：信息技術(shù)的快速提升以及Web技術(shù)的興起推動(dòng)著數(shù)據(jù)信息的獲取、存取向著自動(dòng)化、快速化以及智能化發(fā)展。面對海量的、無規(guī)律的數(shù)據(jù)資源,數(shù)據(jù)挖掘技術(shù)應(yīng)運(yùn)而生。在數(shù)據(jù)挖掘研究中,聚類分析技術(shù)是其中一個(gè)重要的研究分支。聚類分析技術(shù)是一種無監(jiān)督的、具有探索性的分類技術(shù),它是在沒有任何先驗(yàn)知識的前提下,將一個(gè)沒有類別標(biāo)識的數(shù)據(jù)集合進(jìn)行劃分,根據(jù)數(shù)據(jù)對象之間的相識度進(jìn)行劃分,結(jié)果是得到不同簇的集合。目前聚類分析技術(shù)應(yīng)用在眾多領(lǐng)域,如數(shù)據(jù)統(tǒng)計(jì)、電子商務(wù)、Web分析、生物醫(yī)藥、營銷分析等。K-means算法是一個(gè)經(jīng)典的聚類分析算法,算法基于劃分技術(shù),通過選取初始聚類中心將數(shù)據(jù)集進(jìn)行合理的分類,根據(jù)生成的聚類的平均值來合理地調(diào)整聚類的中心點(diǎn)。算法通過多次迭代,最終實(shí)現(xiàn)簇內(nèi)相似性最大,簇間相似性最小。K-means算法原理簡單、容易實(shí)現(xiàn),在對大規(guī)模數(shù)據(jù)集進(jìn)行處理時(shí)具有較好的延展性和時(shí)間復(fù)雜度。但是,它仍存在許多的缺點(diǎn),如：K-means算法對初始聚類中心的選擇很敏感,中心的不當(dāng)選擇會造成聚類分析結(jié)果的較大誤差；算法最終的分析結(jié)果往往是局部最優(yōu)結(jié)果,但對于全局不是最優(yōu)結(jié)果。此外,K-means算法需要事先給定初始聚類的個(gè)數(shù)k。本文以自適應(yīng)特征權(quán)重和遺傳算法為理論基礎(chǔ),解決了傳統(tǒng)K-means算法中的部分不足,避免聚類分析結(jié)果陷入局部最優(yōu),有效提高算法的準(zhǔn)確性和穩(wěn)定性。針對傳統(tǒng)K-means算法固定特征權(quán)重不靈活對初始聚類中心的選取有很大依賴性的缺點(diǎn),可以按照屬性重要程度越高,權(quán)值越大的原則對屬性的權(quán)值進(jìn)行調(diào)整,使人們可以清晰看出屬性的重要級別。在不指定K值的前提下,算法根據(jù)數(shù)據(jù)對象密度的大小,在高密度集合中選取若干代表性的對象作為初始聚類中心,通過對準(zhǔn)則函數(shù)的比較得出最優(yōu)的K,算法在迭代的過程中依據(jù)簇類內(nèi)盡可能相似、簇類間盡可能相異的準(zhǔn)則變化屬性的特征權(quán)重值。將遺傳算法與自適應(yīng)權(quán)重結(jié)合后運(yùn)用在K-means算法上,對其進(jìn)行改進(jìn),即在屬性權(quán)重的基礎(chǔ)上,用遺傳算法的全局搜索能力來獲得較優(yōu)的聚類中心,最后使用K-means算法進(jìn)行優(yōu)化。這種方法能很好地降低K-means算法對初始中心的依賴性,提高算法的聚類效果。將此算法在實(shí)驗(yàn)數(shù)據(jù)集上進(jìn)行試驗(yàn)后,并將其運(yùn)用在聚類算法的應(yīng)用領(lǐng)域之一的圖像分割上,比較其分割效果。實(shí)驗(yàn)采用標(biāo)準(zhǔn)數(shù)據(jù)集對兩個(gè)改進(jìn)的算法進(jìn)行驗(yàn)證,從準(zhǔn)確率、迭代次數(shù)和聚類中心幾個(gè)方面進(jìn)行分析,并與傳統(tǒng)K-means算法進(jìn)行比較,證實(shí)了改進(jìn)K-means聚類分析算法的高效性。
[Abstract]:The rapid improvement of information technology and the rise of Web technology promote the acquisition of data information, access to automation, rapid and intelligent development. In the research of data mining, clustering analysis is an important research branch. Clustering analysis is an unsupervised and exploratory classification technology. Without any prior knowledge, it divides a data set without class identification, and divides it according to the degree of acquaintance between data objects. The result is the collection of different clusters. At present, cluster analysis technology is applied in many fields, such as data statistics, e-commerce Web analysis, biomedicine, marketing analysis and so on. K-means algorithm is a classical clustering analysis algorithm, which is based on partitioning technology. By selecting the initial clustering center to classify the data set reasonably, the center point of the cluster can be adjusted reasonably according to the average value of the generated clustering. The algorithm achieves the maximum similarity in the cluster through multiple iterations. The algorithm of minimum similarity between clusters. K-means is simple in principle and easy to implement. It has good extensibility and time complexity in processing large data sets. However, it still has many shortcomings. Such as: K-means algorithm is very sensitive to the selection of initial clustering center, improper selection of center will result in a large error in the result of clustering analysis, the final analysis result of the algorithm is often the local optimal result. But the global is not the optimal result. In addition, the K-means algorithm needs to give the number of the initial clustering k. based on the adaptive feature weight and genetic algorithm, this paper solves some of the shortcomings of the traditional K-means algorithm. In order to avoid the clustering results falling into local optimum and effectively improve the accuracy and stability of the algorithm, the traditional K-means algorithm has the disadvantage that the fixed feature weights are inflexible and depend heavily on the selection of initial clustering centers. The weight of attribute can be adjusted according to the principle that the importance of attribute is higher and the weight of attribute is bigger, so that people can clearly see the importance level of attribute. Without specifying K value, the algorithm is based on the density of data object. Some representative objects are selected as the initial clustering center in the high density set. By comparing the criterion functions, the optimal Ks are obtained, and the algorithm is as similar as possible according to the cluster class in the iterative process. The genetic algorithm and adaptive weight are combined with K-means algorithm to improve the attribute weight, that is, on the basis of attribute weight. The global search ability of genetic algorithm is used to obtain the optimal clustering center, and the K-means algorithm is used to optimize the cluster center. This method can reduce the dependence of K-means algorithm on the initial center. After the experiment on the experimental data set, the algorithm is applied to the image segmentation, which is one of the application fields of the clustering algorithm. The experiment uses standard data set to verify the two improved algorithms, analyzes them from the aspects of accuracy, iteration times and clustering center, and compares them with the traditional K-means algorithm. The improved K-means clustering algorithm is proved to be efficient.
【學(xué)位授予單位】：安徽大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2015
【分類號】：TP311.13

【引證文獻(xiàn)】

相關(guān)期刊論文前1條

1 胡濤;王濤;史永帥;鞠明遠(yuǎn);;一種改進(jìn)的K-means算法在智能用電數(shù)據(jù)分析上的應(yīng)用[J];信息技術(shù)與信息化;2016年09期

相關(guān)碩士學(xué)位論文前1條

1 鄭偉娜;圖像分類中特征聚類算法研究[D];燕山大學(xué);2016年

，

本文編號：1580416

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/yingxiaoguanlilunwen/1580416.html

上一篇：我國網(wǎng)絡(luò)自制劇的內(nèi)容生產(chǎn)和營銷模式研究
下一篇：GA童裝公司線上營銷策略研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

改進(jìn)K-means聚類算法的研究