面向復(fù)雜數(shù)據(jù)的聚類算法研究
本文選題:聚類 切入點(diǎn):環(huán)境污染抽樣數(shù)據(jù) 出處:《蘭州大學(xué)》2016年博士論文
【摘要】:近年來,數(shù)據(jù)量急劇增長(zhǎng),數(shù)據(jù)源的種類日益增多,導(dǎo)致從復(fù)雜的數(shù)據(jù)中獲取有用的信息變得越來越困難。要想很好地利用這些數(shù)據(jù),就必須理解這些復(fù)雜的數(shù)據(jù),從中挖掘出其內(nèi)在的模式。聚類分析能根據(jù)數(shù)據(jù)間的相似性識(shí)別出數(shù)據(jù)集中的內(nèi)在模式。但很多聚類算法在劃分不同類型的數(shù)據(jù)集時(shí),都會(huì)遇到精確性不高或者執(zhí)行效率較低等問題,這就需要投入更多的精力去提高聚類算法的性能。本論文以更高效、更精確地對(duì)復(fù)雜數(shù)據(jù)進(jìn)行聚類為目的,針對(duì)三種不同類型的數(shù)據(jù),集中在聚類研究的三個(gè)方面,提出了四個(gè)聚類算法:EPC、MulSim、CLUB和SUM。EPC是一個(gè)根據(jù)污染特征將大氣污染抽樣數(shù)據(jù)進(jìn)行聚類的算法,它能提高CMB、PMF、UNMIX和PCA等源解析模型的精確性,并且和傳統(tǒng)算法算法相比更易于使用、更適合于聚類高維數(shù)據(jù);MulSim和CLUB是挖掘數(shù)據(jù)集中包含的任意形狀、任意密度以及任意規(guī)模的簇的兩個(gè)聚類算法,其中,MulSim基于單點(diǎn)與多點(diǎn)相似的策略進(jìn)行聚類,CLUB通過識(shí)別簇的密度主干進(jìn)行聚類;SUM是對(duì)圖數(shù)據(jù)中的頂點(diǎn)進(jìn)行聚類的算法,其基本原理是質(zhì)疑簇中的最大度頂點(diǎn)在聚類時(shí)對(duì)其他頂點(diǎn)的連接作用。(1)EPC EPC在第一步對(duì)數(shù)據(jù)進(jìn)行預(yù)處理后,迭代進(jìn)行第二步,每次迭代選擇第一個(gè)未標(biāo)記的數(shù)據(jù)點(diǎn)作為一個(gè)簇中心點(diǎn),然后根據(jù)本文提出的相似性函數(shù)和用戶給定的相似性閾值,把每個(gè)數(shù)據(jù)點(diǎn)分配到與它最相似的中心點(diǎn)所屬的簇,最后利用與k-Means相似的方法對(duì)簇進(jìn)行更新,形成最終的簇結(jié)構(gòu)。本文在實(shí)驗(yàn)部分通過人工數(shù)據(jù)集和真實(shí)數(shù)據(jù)集分別驗(yàn)證了EPC算法的有效性。結(jié)果表明,EPC算法不但能根據(jù)污染特征的相似性對(duì)環(huán)境污染抽樣數(shù)據(jù)進(jìn)行聚類,而且還能同時(shí)檢測(cè)出其中的異常點(diǎn)。(2)MulSim MulSim定義了一個(gè)能自動(dòng)適應(yīng)數(shù)據(jù)點(diǎn)密度變化的相似性函數(shù),若一個(gè)數(shù)據(jù)點(diǎn)同時(shí)與另一個(gè)數(shù)據(jù)點(diǎn)以及該點(diǎn)的鄰居相似,就認(rèn)為這兩個(gè)數(shù)據(jù)點(diǎn)屬于同一個(gè)簇。實(shí)驗(yàn)結(jié)果顯示,在測(cè)試的任意密度數(shù)據(jù)集、統(tǒng)一密度數(shù)據(jù)集、簇內(nèi)包含多個(gè)中心點(diǎn)的數(shù)據(jù)集、包含螺旋形簇的數(shù)據(jù)集、包含球狀簇的數(shù)據(jù)集、包含任意形狀簇的數(shù)據(jù)集以及多維數(shù)據(jù)集等各種類型的數(shù)據(jù)集上,MulSim的聚類質(zhì)量在多數(shù)情況下優(yōu)于六個(gè)對(duì)比算法。(3)CLUB CLUB首先基于互k最近鄰方法發(fā)現(xiàn)初始簇,接著將初始簇作為算法第二步的輸入,基于k最近鄰方法識(shí)別出簇的密度主干。然后,通過把無標(biāo)簽的數(shù)據(jù)點(diǎn)分配給密度比它大的最近鄰所在的簇以形成最終簇結(jié)構(gòu)。最后,從簇的內(nèi)部檢測(cè)出異常點(diǎn)。實(shí)驗(yàn)部分在九個(gè)包含任意形狀、任意密度、任意規(guī)模簇的二維數(shù)據(jù)集以及七個(gè)廣泛使用的多維數(shù)據(jù)集上,通過與三個(gè)經(jīng)典算法、三個(gè)新算法進(jìn)行比較,對(duì)CLUB的性能進(jìn)行了評(píng)價(jià)。而且,還將CLUB應(yīng)用于Olivetti Face數(shù)據(jù)集上,展示了其在人臉識(shí)別中的有效性。實(shí)驗(yàn)結(jié)果顯示,CLUB在大多數(shù)情況下優(yōu)于對(duì)比算法。(4)SUM SUM利用相鄰頂點(diǎn)間的公共鄰居個(gè)數(shù)和較小度頂點(diǎn)的度定義了一個(gè)相似度函數(shù)。在將相似的頂點(diǎn)放置到同一個(gè)簇中之后,SUM質(zhì)疑簇中的最大度頂點(diǎn)對(duì)其他頂點(diǎn)的連接作用,斷開簇中最大度頂點(diǎn)與其鄰居頂點(diǎn)的連線,將最大度頂點(diǎn)重新分配來獲得初始簇。然后,SUM將尚未標(biāo)記的點(diǎn)分配給初始簇后,調(diào)整邊界點(diǎn)以形成最終簇。通過與四個(gè)經(jīng)典的、兩個(gè)新的圖聚類算法在四個(gè)有真實(shí)簇結(jié)構(gòu)、四個(gè)無真實(shí)簇結(jié)構(gòu)的圖上的實(shí)驗(yàn)比較顯示,SUM能夠較精確地檢測(cè)出簇結(jié)構(gòu),并且結(jié)果優(yōu)于對(duì)比算法。四個(gè)算法的時(shí)間復(fù)雜度都接近于線性復(fù)雜度。所以,這四個(gè)算法均能以較高的精確性對(duì)其相應(yīng)特征的數(shù)據(jù)集高效地進(jìn)行聚類分析。
[Abstract]:In recent years, the explosive growth of data types of data sources is increasing, to become more and more difficult to obtain useful information from complex data. In order to make good use of these data, we must understand these complex data, dig out its inherent pattern. From the cluster analysis according to the data similarity identify internal mode data sets. But a lot of clustering algorithm in different data sets, will meet the accuracy is not high or low efficiency problems, which need to devote more energy to improve the performance of clustering algorithms. In this paper, more efficient and more accurate for complex data clustering for the purpose, for three different types of data, focused on the three aspects of clustering research, propose four clustering algorithms: EPC, MulSim, CLUB and SUM.EPC is a according to the pollution characteristics of air pollution will be sampled data The clustering algorithm, it can improve the accuracy of the model CMB, PMF, source apportionment of UNMIX and PCA, and compared with traditional algorithm algorithm is easier to use, more suitable for clustering high dimensional data mining; MulSim and CLUB are arbitrary data set contains the arbitrary density and two arbitrary scale clustering algorithm. The cluster, MulSim cluster based on single point and multi-point similar strategy, CLUB cluster by density cluster SUM is the main recognition; clustering of graph vertices in the algorithm, the basic principle is to question connection in the cluster with maximum degree vertex to other vertices (in clustering. 1) EPC EPC in the first step of data pre-processing, iterative second step, each iteration selects the first unlabeled data points as a center of the cluster, and then based on the similarity function and the user to set the similarity threshold Value, assign each data point to the center of the most similar and it belongs to the cluster, finally using the method similar to k-Means on the cluster update form cluster structure. The final part through experiments in this paper verified the effectiveness of EPC algorithm on artificial and real datasets respectively. The results show that the EPC algorithm not only can be carried out according to the clustering of environmental pollution sampling data similarity pollution characteristics, but also detect the abnormal points of them. (2) MulSim MulSim is defined to automatically adapt to changes in the data point density similarity function, if a point at the same time similar to another data point and the neighbor point the thought of these two data points belong to the same cluster. The experimental results show that the test data sets in arbitrary density, uniform density data set, cluster contains multiple center data sets, including spiral cluster data set, The globular clusters contain data sets, including arbitrary shape clustering data sets and multidimensional data sets such as various types of data sets, the quality of clustering is better than MulSim in most cases six contrast algorithm. (3) CLUB CLUB based on mutual k nearest neighbor method of initial clusters, then the second step algorithm as the initial cluster the input of the k nearest neighbor method to identify the main cluster based on density. Then, through the distribution of unlabeled data points to the nearest cluster where the density is larger than it to form the final cluster structure. Finally, from the internal point detect abnormal clusters. In the experimental section nine contains arbitrary shape and arbitrary density a two-dimensional data set, arbitrary scale clusters and seven multidimensional data sets are widely used, with three classic algorithms, three new algorithms are compared, the performance of CLUB is evaluated. And also the application of CLUB in Olivetti Face According to the set, demonstrating its effectiveness in face recognition. The experimental results show that CLUB is better than that in most cases the comparison algorithm. (4) SUM SUM with adjacent vertices of the public number of neighbors and small degree vertices of a similarity function. After similar to vertex placement in the same cluster, SUM questioned connection in a cluster of other vertices of maximum degree vertices, disconnect the cluster in the maximum degree vertex and its neighbor vertex connectivity, maximum degree vertex re distribution to obtain the initial cluster. Then, SUM will point to the initial cluster allocation has not been marked after adjusting the boundary point to form the final cluster. With four classic, two new graph clustering algorithm in the four true cluster structure shows four non real cluster experiments on graphs, SUM can accurately detect the cluster structure, and the result is better than the comparison algorithm. Four algorithms The time complexity is close to linear complexity. Therefore, these four algorithms can efficiently cluster analysis of their corresponding characteristic data sets with high accuracy.
【學(xué)位授予單位】:蘭州大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 ;數(shù)據(jù)集N鄽2[J];航空材料;1959年09期
2 江海洪 ,羅長(zhǎng)坤;首套中國(guó)數(shù)字化可視人體數(shù)據(jù)集在第三軍醫(yī)大學(xué)研制成功[J];中華醫(yī)學(xué)雜志;2003年09期
3 陳相穎;數(shù)據(jù)集記錄快速定位與篩選方法之探討[J];計(jì)量與測(cè)試技術(shù);2005年06期
4 張曉斌;魏永祥;韓德民;夏寅;李希平;原林;唐雷;王興海;;數(shù)字化耳鼻咽喉數(shù)據(jù)集的采集[J];中華耳鼻咽喉頭頸外科雜志;2005年06期
5 王宏鼎;唐世渭;董國(guó)田;;數(shù)據(jù)集成中數(shù)據(jù)集特征的檢測(cè)方法[J];中國(guó)金融電腦;2006年03期
6 張華;郁書好;;時(shí)空數(shù)據(jù)集的連接處理和優(yōu)化方法研究[J];皖西學(xué)院學(xué)報(bào);2006年02期
7 苗卿;單立新;裘昱;;信息熵在數(shù)據(jù)集分割中的應(yīng)用研究[J];電腦知識(shí)與技術(shù)(學(xué)術(shù)交流);2007年05期
8 陳德誠(chéng);丘平珠;唐炳莉;;廣西氣象數(shù)據(jù)集設(shè)計(jì)與制作[J];氣象研究與應(yīng)用;2007年04期
9 趙鳳英;王崇駿;陳世福;;用于不均衡數(shù)據(jù)集的挖掘方法[J];計(jì)算機(jī)科學(xué);2007年09期
10 劉密霞;張秋余;趙宏;余冬梅;;入侵檢測(cè)報(bào)警相關(guān)性及評(píng)測(cè)數(shù)據(jù)集研究[J];計(jì)算機(jī)應(yīng)用研究;2008年10期
相關(guān)會(huì)議論文 前10條
1 田捷;;三維醫(yī)學(xué)影像數(shù)據(jù)集處理的集成化平臺(tái)[A];2003年全國(guó)醫(yī)學(xué)影像技術(shù)學(xué)術(shù)會(huì)議論文匯編[C];2003年
2 范明;魏芳;;挖掘基本顯露模式用于分類[A];第二十一屆中國(guó)數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集(技術(shù)報(bào)告篇)[C];2004年
3 冷傳良;;飛機(jī)化銑成樣板劃線數(shù)據(jù)集設(shè)計(jì)方法探索[A];第十屆沈陽科學(xué)學(xué)術(shù)年會(huì)論文集(信息科學(xué)與工程技術(shù)分冊(cè))[C];2013年
4 孟燁;張鵬;宋大為;王雷;;信息檢索系統(tǒng)性能對(duì)數(shù)據(jù)集特性的依賴性分析[A];第十二屆全國(guó)人機(jī)語音通訊學(xué)術(shù)會(huì)議(NCMMSC'2013)論文集[C];2013年
5 段磊;唐常杰;左R,
本文編號(hào):1711149
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1711149.html