高維分類(lèi)數(shù)據(jù)聚類(lèi)方法研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-04-17 02:36
本文選題:分類(lèi)數(shù)據(jù) + 子空間聚類(lèi) ; 參考:《東華大學(xué)》2017年碩士論文
【摘要】:聚類(lèi)分析作為一種無(wú)監(jiān)督的機(jī)器學(xué)習(xí)方法,根據(jù)一定的規(guī)則,將原本雜亂無(wú)章的數(shù)據(jù)分成一系列簇,使得每個(gè)簇由相似度較高的數(shù)據(jù)組成,這為后續(xù)的數(shù)據(jù)分析提供了極大的便利,被廣泛地應(yīng)用于網(wǎng)絡(luò)服務(wù)、地理、生物、貿(mào)易等多個(gè)領(lǐng)域。但隨著數(shù)據(jù)產(chǎn)生渠道及數(shù)據(jù)收集技術(shù)的發(fā)展,用于分析的數(shù)據(jù)維度及復(fù)雜度也越來(lái)越大,傳統(tǒng)的數(shù)據(jù)聚類(lèi)算法在這些數(shù)據(jù)集上無(wú)法取得較好的聚類(lèi)結(jié)果。軟子空間聚類(lèi)作為高維數(shù)據(jù)聚類(lèi)領(lǐng)域的一個(gè)研究熱點(diǎn),受到人們?cè)絹?lái)越多的關(guān)注。但針對(duì)分類(lèi)數(shù)據(jù),目前已有的軟子空間聚類(lèi)算法大多都是基于k-modes算法的擴(kuò)展,其數(shù)據(jù)間相似性的計(jì)算及屬性(也稱(chēng)為特征)的權(quán)值計(jì)算都依賴(lài)類(lèi)中心(modes)選擇,從而modes選的好壞直接影響了最終的聚類(lèi)質(zhì)量。同時(shí),現(xiàn)有的軟子空間聚類(lèi)算法在聚類(lèi)時(shí)對(duì)缺失數(shù)據(jù)和完整數(shù)據(jù)不加以區(qū)分,也很大程度上影響了最終的聚類(lèi)結(jié)果。針對(duì)高維不完整的分類(lèi)數(shù)據(jù),本文將基于簇直方圖高寬比聚類(lèi)思想的CLOPE算法應(yīng)用于軟子空間聚類(lèi),并提出了一個(gè)新的軟子空間聚類(lèi)算法。首先,結(jié)合粗糙集提出了一個(gè)缺失數(shù)據(jù)處理方法,來(lái)處理數(shù)據(jù)集中的缺失數(shù)據(jù),同時(shí),根據(jù)屬性的平均互信息對(duì)屬性加權(quán);然后,針對(duì)CLOPE算法的聚類(lèi)質(zhì)量受數(shù)據(jù)輸入順序影響的問(wèn)題,提出了對(duì)數(shù)據(jù)完全隨機(jī)排序的 洗牌模型‖來(lái)最大程度消除數(shù)據(jù)輸入順序?qū)ψ罱K聚類(lèi)質(zhì)量的影響;最后,利用Scala語(yǔ)言在Spark平臺(tái)上實(shí)現(xiàn)了該算法,使其能用于大規(guī)模數(shù)據(jù)的聚類(lèi)。本文選擇UCI中的真實(shí)數(shù)據(jù)作為本文的實(shí)驗(yàn)數(shù)據(jù),進(jìn)行了4組實(shí)驗(yàn),分別用來(lái)驗(yàn)證洗牌模型及屬性加權(quán)方法的有效性、缺失數(shù)據(jù)處理方法的有效性、本文提出的軟子空間算法的有效性及對(duì)數(shù)據(jù)規(guī)模的可擴(kuò)展性。實(shí)驗(yàn)結(jié)果表明,本文算法(未使用缺失數(shù)據(jù)處理方法的版本)的聚類(lèi)質(zhì)量明顯優(yōu)于CLOPE。與最大頻率填補(bǔ)和不做處理這兩種方式相比,隨著數(shù)據(jù)缺失率的增加,本文提出的缺失數(shù)據(jù)處理方法的優(yōu)勢(shì)也越明顯。與另外兩個(gè)典型的針對(duì)分類(lèi)數(shù)據(jù)的軟子空間聚類(lèi)算法相比,無(wú)論是從聚類(lèi)質(zhì)量還是運(yùn)行時(shí)間上,本文算法都有明顯的優(yōu)勢(shì)。
[Abstract]:Clustering analysis as an unsupervised machine learning method, according to certain rules, the original data is divided into a series of clusters, so that each cluster is composed of data with high similarity.This provides great convenience for subsequent data analysis and is widely used in many fields, such as network services, geography, biology, trade and so on.However, with the development of data generation channel and data collection technology, the dimension and complexity of data used for analysis are increasing, and the traditional data clustering algorithm can not obtain better clustering results on these data sets.Soft subspace clustering, as a research hotspot in the field of high dimensional data clustering, has attracted more and more attention.However, for classified data, most of the existing soft subspace clustering algorithms are based on the extension of k-modes algorithm, and the calculation of similarity between data and the weight calculation of attributes (also called features) depend on the selection of class center.Therefore, the quality of modes selection has a direct impact on the final clustering quality.At the same time, the existing soft subspace clustering algorithms do not distinguish the missing data from the complete data in clustering, and to a large extent affect the final clustering results.In this paper, CLOPE algorithm based on cluster histogram aspect ratio clustering is applied to soft subspace clustering, and a new soft subspace clustering algorithm is proposed.Firstly, a missing data processing method based on rough set is proposed to deal with the missing data in the dataset. At the same time, the attributes are weighted according to the average mutual information of the attributes.Aiming at the problem that the clustering quality of CLOPE algorithm is affected by the order of data input, a shuffling model of complete random sorting of data is proposed to eliminate the effect of data input order on the final clustering quality to the greatest extent.The algorithm is implemented on Spark platform by using Scala language, which can be used for large scale data clustering.In this paper, the real data in UCI is chosen as the experimental data, and four groups of experiments are conducted to verify the validity of shuffling model and attribute weighting method, and the validity of missing data processing method.In this paper, the validity of soft subspace algorithm and its scalability to data scale are discussed.The experimental results show that the clustering quality of the proposed algorithm (not using the version of missing data processing method) is obviously superior to that of CLOPE.Compared with the maximum frequency filling method and the non-processing method, the advantages of the proposed missing data processing method are more obvious with the increase of the data loss rate.Compared with the other two typical soft subspace clustering algorithms for classified data, this algorithm has obvious advantages in terms of clustering quality and running time.
【學(xué)位授予單位】:東華大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP181;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 丁祥武;郭濤;王梅;金冉;;一種大規(guī)模分類(lèi)數(shù)據(jù)聚類(lèi)算法及其并行實(shí)現(xiàn)[J];計(jì)算機(jī)研究與發(fā)展;2016年05期
2 李曄鋒;樂(lè)嘉錦;王梅;張濱;劉良旭;;MR-CLOPE: A Map Reduce based transactional clustering algorithm for DNS query log analysis[J];Journal of Central South University;2015年09期
3 程玉根;;2004—2007年鹽城地區(qū)無(wú)償獻(xiàn)血者血液檢測(cè)結(jié)果分析[J];中國(guó)輸血雜志;2009年01期
4 李潔,高新波,焦李成;模糊CLOPE算法及其參數(shù)優(yōu)選[J];控制與決策;2004年11期
,本文編號(hào):1761714
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1761714.html
最近更新
教材專(zhuān)著