高維數(shù)據(jù)流快速降維聚類算法研究

發(fā)布時(shí)間：2019-05-18 08:09

【摘要】：數(shù)據(jù)的爆炸式增長,使得從數(shù)據(jù)中發(fā)現(xiàn)有價(jià)值的信息并將其轉(zhuǎn)化為有組織的知識(shí)變得更加困難,于是數(shù)據(jù)挖掘應(yīng)運(yùn)而生。而作為數(shù)據(jù)挖掘的重要研究方法之一,聚類分析在許多領(lǐng)域被廣泛使用。而隨著信息技術(shù)的不斷發(fā)展,數(shù)據(jù)流成為了一種新的數(shù)據(jù)類型,并逐漸成為主流。于是對(duì)數(shù)據(jù)流的聚類算法的研究變得熱門而富有意義。高維數(shù)據(jù)流聚類算法包括降維和聚類兩個(gè)部分,本文分別針對(duì)已有的降維算法和聚類算法中存在的不足,提出了自己的改進(jìn)算法,并用實(shí)驗(yàn)證明了改進(jìn)算法的優(yōu)勢(shì)。本文在別人的基礎(chǔ)上,針對(duì)高維數(shù)據(jù)流子空間降維算法無法根據(jù)數(shù)據(jù)流的動(dòng)態(tài)變化自動(dòng)調(diào)整降維結(jié)果和需要多次掃描數(shù)據(jù)流的問題,提出了基于結(jié)構(gòu)樹的高維數(shù)據(jù)流子空間自適應(yīng)降維算法。該算法通過改進(jìn)相對(duì)熵尋找區(qū)域的相關(guān)維,繼而建立起對(duì)應(yīng)的子空間,并在子空間中實(shí)現(xiàn)聚類,確保了不同的區(qū)域?qū)?yīng)不同的子空間。利用相對(duì)熵尋找區(qū)域相關(guān)維相對(duì)于孫玉芬的GSCDS算法更簡單更自然。同時(shí)使用結(jié)構(gòu)樹保存劃分過程相關(guān)信息,并結(jié)合回溯算法的思想,實(shí)現(xiàn)了對(duì)高維數(shù)據(jù)流子空間聚類算法的自適應(yīng)功能,避免了算法每次面對(duì)新數(shù)據(jù)都需要重新運(yùn)行子空間算法的尷尬,衰減因子的使用也避免了舊數(shù)據(jù)對(duì)聚類結(jié)果的過度影響。實(shí)驗(yàn)結(jié)果表明算法以較小的時(shí)間復(fù)雜度取得了較高的聚類質(zhì)量。將基于網(wǎng)格的聚類算法應(yīng)用在降維結(jié)果的聚類處理中保留了網(wǎng)格算法高效,自適應(yīng)能力強(qiáng)的優(yōu)點(diǎn),但網(wǎng)格的劃分導(dǎo)致類邊緣精度低下的問題,影響了聚類質(zhì)量,于是本文針對(duì)基于網(wǎng)格的數(shù)據(jù)流聚類算法存在的簇邊緣精度低下以及需要多次掃描網(wǎng)格才能實(shí)現(xiàn)聚類的問題,提出了一種改進(jìn)的數(shù)據(jù)流聚類算法。該算法主要有兩個(gè)方面的改進(jìn):首先在初始聚類階段采用從內(nèi)到外、從點(diǎn)到面的方法實(shí)現(xiàn)了通過一次性掃描網(wǎng)格完成聚類以解決原算法中反復(fù)掃描網(wǎng)格造成的效率低下的問題;然后通過尋找最大密度相連集合來最大限度地區(qū)分邊緣地區(qū)的噪聲點(diǎn)和有用點(diǎn),以解決原算法中邊緣點(diǎn)缺失的問題。最后通過實(shí)驗(yàn)證明,本文所改進(jìn)的算法對(duì)提高類邊緣精度具有很好的效果,且對(duì)數(shù)據(jù)的分布具有較好的適應(yīng)性。
[Abstract]:With the explosive growth of data, it is more difficult to find valuable information from data and transform it into organized knowledge, so data mining emerges as the times require. As one of the important research methods of data mining, clustering analysis is widely used in many fields. With the continuous development of information technology, data flow has become a new data type, and gradually become the mainstream. Therefore, the research on clustering algorithm of data flow becomes hot and meaningful. The clustering algorithm of high-dimensional data flow includes two parts: reduction and clustering. In this paper, aiming at the shortcomings of the existing dimensionality reduction algorithm and clustering algorithm, an improved algorithm is proposed, and the advantages of the improved algorithm are proved by experiments. In this paper, on the basis of others, the high-dimensional data carrier space dimension reduction algorithm can not automatically adjust the dimensionality reduction results according to the dynamic changes of the data stream and needs to scan the data stream many times. An adaptive dimension reduction algorithm for high dimensional data carrier space based on structure tree is proposed. By improving the relative entropy to find the correlation dimension of the region, the algorithm establishes the corresponding subspace, and implements clustering in the subspace to ensure that different regions correspond to different subspaces. Using relative entropy to find regional correlation dimension is simpler and more natural than Sun Yufen's GSCDS algorithm. At the same time, the structure tree is used to save the relevant information of the partition process, and combined with the idea of backtracking algorithm, the adaptive function of high dimensional data carrier space clustering algorithm is realized. It avoids the embarrassment that the algorithm needs to rerun the subspace algorithm every time it faces the new data, and the use of the attenuation factor also avoids the excessive influence of the old data on the clustering results. The experimental results show that the algorithm achieves high clustering quality with small time complexity. The clustering algorithm based on grid is applied to the clustering processing of dimension reduction results, which preserves the advantages of efficient grid algorithm and strong adaptive ability, but the classification of grid leads to the problem of low precision of class edge, which affects the clustering quality. In this paper, an improved data flow clustering algorithm is proposed to solve the problems of low cluster edge accuracy and multiple scanning of grid to realize clustering in grid-based data flow clustering algorithm. The algorithm is mainly improved in two aspects: firstly, in the initial clustering stage, the method from inside to outside and from point to surface is used to complete clustering by scanning grid at one time to solve the problem of low efficiency caused by repeatedly scanning grid in the original algorithm; Then, by finding the maximum density connected set to distinguish the noise points and useful points in the edge area to the maximum extent, the problem of missing edge points in the original algorithm can be solved. Finally, the experimental results show that the improved algorithm has a good effect on improving the edge accuracy of the class, and has a good adaptability to the distribution of data.
【學(xué)位授予單位】：長沙理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 賈瑞玉;李振;;基于最小生成樹的層次K-means聚類算法[J];微電子學(xué)與計(jì)算機(jī);2016年03期

2 高亮;謝健;曹天澤;;基于Kd樹改進(jìn)的高效K-means聚類算法[J];計(jì)算技術(shù)與自動(dòng)化;2015年04期

3 邢長征;劉劍;;基于近鄰傳播與密度相融合的進(jìn)化數(shù)據(jù)流聚類算法[J];計(jì)算機(jī)應(yīng)用;2015年07期

4 王彩霞;;基于改進(jìn)引力搜索的混合K-調(diào)和均值聚類算法研究[J];計(jì)算機(jī)應(yīng)用研究;2016年01期

5 支曉斌;許朝暉;;魯棒的特征權(quán)重自調(diào)節(jié)軟子空間聚類算法[J];計(jì)算機(jī)應(yīng)用;2015年03期

6 亢紅領(lǐng);李明楚;焦棟;郭成;徐淑珍;;一種基于屬性相關(guān)度的子空間聚類算法[J];小型微型計(jì)算機(jī)系統(tǒng);2015年02期

7 高兵;張健沛;鄒啟杰;;基于共享最近鄰密度的演化數(shù)據(jù)流聚類算法[J];北京科技大學(xué)學(xué)報(bào);2014年12期

8 邢長征;王曉旭;;基于擴(kuò)展網(wǎng)格和密度的數(shù)據(jù)流聚類算法[J];計(jì)算機(jī)工程;2014年12期

9 劉波;王紅軍;成聰;楊燕;;基于屬性最大間隔的子空間聚類[J];南京大學(xué)學(xué)報(bào)(自然科學(xué));2014年04期

10 王治和;楊晏;;基于雙層網(wǎng)格和密度的數(shù)據(jù)流聚類算法[J];計(jì)算機(jī)工程;2014年04期

相關(guān)博士學(xué)位論文前4條

1 王平水;基于聚類的匿名化隱私保護(hù)技術(shù)研究[D];南京航空航天大學(xué);2013年

2 趙旭劍;中文新聞話題動(dòng)態(tài)演化及其關(guān)鍵技術(shù)研究[D];中國科學(xué)技術(shù)大學(xué);2012年

3 魏小濤;在線自適應(yīng)網(wǎng)絡(luò)異常檢測(cè)系統(tǒng)模型與相關(guān)算法研究[D];北京交通大學(xué);2009年

4 單世民;基于網(wǎng)格和密度的數(shù)據(jù)流聚類方法研究[D];大連理工大學(xué);2006年

相關(guān)碩士學(xué)位論文前10條

1 王理想;子空間高維聚類算法的研究[D];重慶理工大學(xué);2015年

2 胡國輝;基于不規(guī)則網(wǎng)格的高維數(shù)據(jù)流聚類算法研究[D];燕山大學(xué);2014年

3 張焯;基于聚類的軟件模塊缺陷預(yù)測(cè)方法研究[D];重慶大學(xué);2014年

4 楊志;基于粒子群的粗糙聚類算法分析與研究[D];長沙理工大學(xué);2014年

5 白云悅;基于DBSCAN和相似度的子空間聚類算法研究[D];燕山大學(xué);2013年

6 鄭燕;基于增量學(xué)習(xí)的自適應(yīng)話題追蹤技術(shù)研究[D];山東師范大學(xué);2013年

7 廖浩偉;基于網(wǎng)頁結(jié)構(gòu)聚類的Web信息提取技術(shù)研究[D];西南交通大學(xué);2013年

8 靳艷虹;基于PSO的基因表達(dá)數(shù)據(jù)聚類研究[D];中南大學(xué);2013年

9 張井;高維數(shù)據(jù)子空間聚類算法研究[D];天津大學(xué);2012年

10 劉之崗;基于有效維選擇的子空間聚類算法研究[D];燕山大學(xué);2012年

，

本文編號(hào)：2479822

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2479822.html

上一篇：形變體仿真中材質(zhì)本構(gòu)模型的應(yīng)用與設(shè)計(jì)綜述
下一篇：基于隱式可信第三方的數(shù)據(jù)持有性證明算法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

高維數(shù)據(jù)流快速降維聚類算法研究