基于網(wǎng)格密度區(qū)分的多維聚類(lèi)挖掘算法設(shè)計(jì)
本文選題:聚類(lèi)算法 切入點(diǎn):網(wǎng)格 出處:《西安財(cái)經(jīng)學(xué)院》2014年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:聚類(lèi)分析為數(shù)據(jù)挖掘算法的重要組成部分,是數(shù)據(jù)挖掘中的一種分析活動(dòng)。聚類(lèi)算法是總體聚類(lèi)分析的核心,決定了全部聚類(lèi)分析結(jié)果的質(zhì)量。目前,如何在保證算法穩(wěn)定與有效的前提下,進(jìn)一步提高聚類(lèi)效率,,減少用戶(hù)成本和負(fù)擔(dān),成為當(dāng)前非常有意義的研究方向。 由于傳統(tǒng)的聚類(lèi)算法對(duì)計(jì)算機(jī)硬件資源要求比較高,海量數(shù)據(jù)聚類(lèi)運(yùn)算時(shí)間比較長(zhǎng),本文提出了一種新的基于網(wǎng)格和密度的聚類(lèi)算法。一般基于網(wǎng)格的聚類(lèi)具有節(jié)省時(shí)間成本、高效率的特點(diǎn),但它的聚類(lèi)質(zhì)量不是很好;密度的聚類(lèi)算法可以將任意具有相異外形的簇進(jìn)行聚類(lèi),但它在處理高維空間數(shù)據(jù)的時(shí)間復(fù)雜度高。由于這兩者的互補(bǔ)關(guān)系,基于網(wǎng)格密度結(jié)合的策略進(jìn)行樣本空間的區(qū)分,能夠極大的提高聚類(lèi)效率。本文聚類(lèi)算法的思想是:首先,創(chuàng)建網(wǎng)格,對(duì)數(shù)據(jù)空間進(jìn)行初始網(wǎng)格劃分。其次,樣本空間劃分,根據(jù)得到的網(wǎng)格密度閥值,將網(wǎng)格單元的數(shù)據(jù)劃分成高、低密度區(qū)兩部分;將高密度區(qū)所有網(wǎng)格按照密度大小進(jìn)行排列,找到密度最大的網(wǎng)格,利用其周?chē)罱兔芏染W(wǎng)格區(qū)尋找到第一個(gè)高密度簇;將第一個(gè)高密度簇的點(diǎn)去掉,將剩余高密度網(wǎng)格進(jìn)行排序,依次進(jìn)行,直到形成最終空間的劃分結(jié)果。最后,計(jì)算各子簇類(lèi)重心,將臨近簇重心空間合并,形成新簇重心,依次合并空間,直到等于給定簇類(lèi)數(shù),形成最終聚類(lèi)結(jié)果。 本文首先從理論方面對(duì)該算法進(jìn)行了描述,驗(yàn)證了該算法設(shè)計(jì)的合理性和科學(xué)性。最后通過(guò)Matlab隨機(jī)生成幾組數(shù)據(jù)進(jìn)行了實(shí)證分析,驗(yàn)證了本算法能夠在與經(jīng)典的K-means算法組間離差平方和相差不大的條件下,運(yùn)算時(shí)間上有了顯著的改善。
[Abstract]:Clustering analysis is an important part of data mining algorithm and an analysis activity in data mining. Clustering algorithm is the core of overall clustering analysis, which determines the quality of all the results of clustering analysis. How to further improve the clustering efficiency and reduce the cost and burden of users under the premise of ensuring the stability and effectiveness of the algorithm has become a very meaningful research direction. Because the traditional clustering algorithm requires high computer hardware resources, the clustering time of mass data is relatively long. In this paper, a new clustering algorithm based on grid and density is proposed. Generally, the clustering based on grid has the characteristics of saving time cost and high efficiency, but its clustering quality is not very good. The density clustering algorithm can cluster any cluster with different shapes, but it has a high time complexity in processing high-dimensional spatial data. Because of the complementary relationship between the two, the sample space is distinguished based on the combination of grid density. The idea of clustering algorithm in this paper is: firstly, to create grid, to divide the data space into the initial grid, secondly, to divide the sample space, according to the grid density threshold, The data of the grid cells are divided into high and low density areas, and all the grids in the high density region are arranged according to the density to find the most dense grid, and the first high density cluster is found by using the nearest low density grid area around the grid. The point of the first high density cluster is removed, the remaining high density grid is sorted, and then the final space is obtained. Finally, the center of gravity of each subcluster is calculated, and the adjacent center of gravity space is merged to form a new cluster center of gravity. The space is merged in turn until it is equal to a given number of clusters, and the final clustering result is obtained. Firstly, this paper describes the algorithm from the theoretical aspect, and verifies the rationality and scientificity of the algorithm design. Finally, several groups of data are generated randomly by Matlab for empirical analysis. It is verified that the algorithm can significantly improve the operation time under the condition that the sum of squared difference between the two groups is not different from that of the classical K-means algorithm.
【學(xué)位授予單位】:西安財(cái)經(jīng)學(xué)院
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:C81
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 韓家煒,孟小峰,王靜,李盛恩;Web挖掘研究[J];計(jì)算機(jī)研究與發(fā)展;2001年04期
2 岳士弘,王正友;二分網(wǎng)格聚類(lèi)方法及有效性[J];計(jì)算機(jī)研究與發(fā)展;2005年09期
3 胡亮;任維武;任斐;劉曉博;金剛;;基于改進(jìn)密度聚類(lèi)的異常檢測(cè)算法[J];吉林大學(xué)學(xué)報(bào)(理學(xué)版);2009年05期
4 胡文瑜,孫志揮,周曉云;基于最優(yōu)K相異性的密度聚類(lèi)算法研究[J];計(jì)算機(jī)工程與應(yīng)用;2005年22期
5 孟海東;宋飛燕;郝永寬;;基于密度與劃分方法的聚類(lèi)算法設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2007年27期
6 李星毅;包從劍;施化吉;奚春海;;基于加權(quán)快速聚類(lèi)的異常數(shù)據(jù)挖掘算法[J];計(jì)算機(jī)工程與應(yīng)用;2007年35期
7 趙衛(wèi)中;馬慧芳;傅燕翔;史忠植;;基于云計(jì)算平臺(tái)Hadoop的并行k-means聚類(lèi)算法設(shè)計(jì)研究[J];計(jì)算機(jī)科學(xué);2011年10期
8 胡吉祥;許洪波;劉悅;程學(xué)旗;;重復(fù)串特征提取算法及其在文本聚類(lèi)中的應(yīng)用[J];計(jì)算機(jī)工程;2007年02期
9 張玉芳,毛嘉莉,熊忠陽(yáng);一種改進(jìn)的K-means算法[J];計(jì)算機(jī)應(yīng)用;2003年08期
10 鄭洪英;倪霖;肖迪;;大規(guī)模數(shù)據(jù)集聚類(lèi)中的數(shù)據(jù)分區(qū)及應(yīng)用研究[J];計(jì)算機(jī)應(yīng)用研究;2007年02期
本文編號(hào):1633868
本文鏈接:http://sikaile.net/shekelunwen/shgj/1633868.html