基于子空間的聚類算法研究

發(fā)布時間：2018-05-29 08:59

本文選題：高維數(shù)據(jù) + 聚類分析��；參考：《江南大學》2017年碩士論文

【摘要】：隨著生命科學、移動通信、電子商務、社交網絡等相關領域的飛速發(fā)展,涌現(xiàn)出大量的高維數(shù)據(jù),如何有效地對高維數(shù)據(jù)進行聚類分析,成為當下的研究熱點和難點。傳統(tǒng)的聚類分析通常將數(shù)據(jù)對象全部屬性考慮在內,然而高維數(shù)據(jù)中常常包含很多無關的冗余的屬性,這些屬性的存在使得數(shù)據(jù)樣本點間的距離相互接近,使得在整個特征空間中存在類的可能性幾乎為零。子空間聚類方法嘗試在相同數(shù)據(jù)集的不同子空間上進行聚類,有效地解決了這類問題。根據(jù)加權方式的差異,現(xiàn)有算法可分為硬子空間聚類和軟子空間聚類兩種方法。本文從這兩個角度對子空間聚類算法展開了深入研究,主要工作如下:(1)硬子空間聚類算法SUBCLU在自底向上搜索最大興趣子空間類的過程中不斷迭代產生中間類,這個過程消耗大量時間的問題,針對這一問題,本文提出改進算法BDFS-SUBCLU,采用一種帶回溯的深度優(yōu)先搜索策略來挖掘最大興趣子空間中的類,通過這種策略避免了中間類的產生,降低了算法的時間復雜度。同時BDFS-SUBCLU算法在子空間中對核心點增加一種約束,通過這個約束條件在一定程度上避免了聚類過程中相鄰的類由于特殊的數(shù)據(jù)點合為一類的情況。在仿真數(shù)據(jù)集和真實數(shù)據(jù)集上的實驗結果表明BDFS-SUBCLU算法與SUBCLU算法相比,效率和準確性均有所提高。(2)基于k-means算法框架的軟子空間聚類算法大多對初始聚類中心點敏感,不當?shù)某跏季垲愔行狞c會導致其過早陷入局部最優(yōu),針對這一問題,本文提出相應的改進方案:在原有算法的基礎上,通過反饋來驗證算法是否陷入局部最優(yōu),當算法陷入局部最優(yōu)則以當下最優(yōu)為聚類結果并不斷反饋驗證直到不能找到更優(yōu)的聚類結果,同時增設對比組來提高算法跳出局部最優(yōu)的可能性。在UCI真實數(shù)據(jù)集上的實驗結果表明改進后的FSC和EWKM算法準確率均有所提高。(3)運用開源的中文分詞器mmseg4j對中文文本進行分詞處理,然后基于向量空間模型將文本轉化為算法可以處理的數(shù)字矩陣,最后用本文所提的軟子空間聚類算法對其進行聚類分析。
[Abstract]:With the rapid development of life science, mobile communication, electronic commerce, social network and other related fields, a large number of high-dimensional data have emerged. How to effectively cluster analysis of high-dimensional data has become a hot and difficult issue. Traditional clustering analysis usually takes all attributes of data object into account. However, high dimensional data often contains many irrelevant redundant attributes, which make the distance between data sample points close to each other. The possibility of the existence of classes in the entire feature space is almost zero. The subspace clustering method attempts to cluster on different subspaces of the same data set, which effectively solves this kind of problem. According to the difference of weighting methods, the existing algorithms can be divided into two methods: hard subspace clustering and soft subspace clustering. In this paper, the subspace clustering algorithm is studied from these two angles. The main work is as follows: 1) hard subspace clustering algorithm SUBCLU iterates to produce intermediate classes in the process of bottom-up searching for subspace classes of greatest interest. This paper proposes an improved algorithm BDFS-SUBCLU, which uses a backtracking depth first search strategy to mine classes in the subspace of maximum interest, which avoids the generation of intermediate classes. The time complexity of the algorithm is reduced. At the same time, the BDFS-SUBCLU algorithm adds a constraint to the core point in the subspace, which to some extent avoids the confluence of the adjacent classes in the clustering process because of the special data points. Experimental results on simulation data sets and real data sets show that BDFS-SUBCLU algorithm is more efficient and accurate than SUBCLU algorithm.) soft subspace clustering algorithms based on k-means algorithm framework are mostly sensitive to initial clustering center points. Improper initial clustering center points will lead to premature local optimization. In view of this problem, this paper puts forward the corresponding improvement scheme: on the basis of the original algorithm, the feedback is used to verify whether the algorithm falls into local optimal or not. When the algorithm falls into the local optimum, the current optimal is used as the clustering result and the feedback is verified until the better clustering result can not be found. At the same time, a contrast group is added to improve the possibility of the algorithm jumping out of the local optimum. The experimental results on the real data set of UCI show that the accuracy of the improved FSC and EWKM algorithms are both improved. (3) the open source Chinese word Segmentation (mmseg4j) is used to deal with Chinese text segmentation. Then based on the vector space model, the text is transformed into a digital matrix which can be processed by the algorithm. Finally, the soft subspace clustering algorithm proposed in this paper is used for clustering analysis.
【學位授予單位】：江南大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.13

【參考文獻】

相關期刊論文前10條

1 支曉斌;許朝暉;;基于閔科夫斯基距離的特征權重自調節(jié)軟子空間聚類算法[J];計算機應用研究;2016年09期

2 邱云飛;狄龍娟;;基于簇間距離自適應的軟子空間聚類算法[J];計算機工程與應用;2016年21期

3 吳濤;陳黎飛;郭躬德;;優(yōu)化子空間的高維聚類算法[J];計算機應用;2014年08期

4 錢美旋;葉東毅;;利用一維投影分析的無參數(shù)多密度聚類算法[J];小型微型計算機系統(tǒng);2013年08期

5 王曉陽;張洪淵;沈良忠;池萬樂;;基于相似性度量的高維數(shù)據(jù)聚類算法研究[J];計算機技術與發(fā)展;2013年05期

6 孟小峰;慈祥;;大數(shù)據(jù)管理:概念、技術與挑戰(zhàn)[J];計算機研究與發(fā)展;2013年01期

7 畢志升;王甲海;印鑒;;基于差分演化算法的軟子空間聚類[J];計算機學報;2012年10期

8 施萬鋒;胡學鋼;俞奎;;一種面向高維數(shù)據(jù)的均分式Lasso特征選擇方法[J];計算機工程與應用;2012年01期

9 陳黎飛;郭躬德;姜青山;;自適應的軟子空間聚類算法[J];軟件學報;2010年10期

10 賀玲;蔡益朝;楊征;;高維數(shù)據(jù)的相似性度量研究[J];計算機科學;2010年05期

相關博士學位論文前1條

1 陳黎飛;高維數(shù)據(jù)的聚類方法研究與應用[D];廈門大學;2008年

相關碩士學位論文前2條

1 蘇芳仲;中文Web文本挖掘的若干關鍵技術研究及其實現(xiàn)[D];福州大學;2006年

2 張猛;文本聚類中參數(shù)自動設置技術的研究與實現(xiàn)[D];東北大學;2005年

，

本文編號：1950288

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1950288.html

上一篇：基于Hadoop的視覺詞袋模型圖像分類算法
下一篇：基于改進TF-IDF特征提取的文本分類模型研究

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于子空間的聚類算法研究