天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于高維數(shù)據(jù)的聚類算法研究

發(fā)布時間:2018-03-03 17:10

  本文選題:子空間聚類 切入點(diǎn):高維數(shù)據(jù) 出處:《深圳大學(xué)》2017年碩士論文 論文類型:學(xué)位論文


【摘要】:近年來,隨著互聯(lián)網(wǎng)技術(shù)的快速發(fā)展,數(shù)據(jù)的規(guī)模和維度急劇增大,由此帶來維數(shù)災(zāi)難和密度稀疏問題.高維數(shù)據(jù)中通常包含許多冗余的、不相關(guān)的特征和噪聲,給高維數(shù)據(jù)的聚類分析帶來了巨大的挑戰(zhàn).研究表明,高維數(shù)據(jù)的簇結(jié)構(gòu)通常存在于數(shù)據(jù)的某個子空間,而非整個數(shù)據(jù)空間.為了處理高維數(shù)據(jù),國內(nèi)外研究者提出了許多子空間聚類方法.其中,軟子空間聚類是子空間聚類算法中的一個重要研究主題,它為樣本的每個特征分配一個權(quán)重,并通過權(quán)重較大的特征確定簇的子空間結(jié)構(gòu).然而,高維數(shù)據(jù)中的單個特征是微弱的,很難通過單個微弱的特征發(fā)現(xiàn)簇結(jié)構(gòu),對單個特征加權(quán)的方法處理有成千上萬特征的數(shù)據(jù)時效果也并不理想.許多高維數(shù)據(jù)集都是不同方面觀測的集成結(jié)果,以至于不同方面的特征可以進(jìn)行分組,并且不同特征組在不同簇中的重要性也是不同的.有研究者提出為高維數(shù)據(jù)的特征組分配權(quán)重的FG-k-means方法,它把特征分為若干個特征組,引入特征組和單個特征的兩級權(quán)重處理高維數(shù)據(jù),并獲得巨大的性能提升.FG-k-means不能實(shí)現(xiàn)特征的自動分組,需要根據(jù)人的先驗(yàn)知識進(jìn)行特征分組,然而對于許多高維數(shù)據(jù)集,我們事先并不知道特征的分組信息.針對這些問題,本文以高維數(shù)據(jù)為研究對象,主要工作包含以下兩個部分:(1)提出了子空間聚類中的隱藏特征組學(xué)習(xí)模型(LFGL).先前的方法在聚類過程中不能進(jìn)行自動分組,需要人為根據(jù)先驗(yàn)知識進(jìn)行特征分組,然而在許多高維數(shù)據(jù)中我們并不知道特征的分組信息.針對這些問題,本文提出了LFGL模型,首先構(gòu)建一個特征分組模型(FGM),然后嵌入特征分組模型到子空間聚類算法中并構(gòu)造一個優(yōu)化問題,最后在滿足FGM模型的要求下通過一些優(yōu)化算法求解該問題.并在圖像、基因等真實(shí)數(shù)據(jù)集上進(jìn)行試驗(yàn),通過和先前的聚類方法比較發(fā)現(xiàn),LFGL不僅實(shí)現(xiàn)了特征的自動分組,而且獲得了更好的聚類效果.(2)提出了基于深度去噪稀疏自動編碼器(DDSAE)的維度化簡和聚類分析.高維數(shù)據(jù)中存在“維數(shù)災(zāi)難”和密度稀疏,當(dāng)維度增加時,各種聚類方法的性能都出現(xiàn)明顯下降,并且超高維數(shù)據(jù)在單機(jī)中運(yùn)行甚至出現(xiàn)內(nèi)存溢出.本文利用自動編碼器的非線性表達(dá)能力,在自動編碼器中引入L2范數(shù)防止過擬合、在輸入數(shù)據(jù)中添加噪聲提高模型的魯棒性,并使用交叉熵作為損失函數(shù),然后將多個編碼器疊加構(gòu)成深度去噪稀疏自動編碼器.深度去噪稀疏自動編碼器從高維數(shù)據(jù)中學(xué)習(xí)得到低維抽象的本質(zhì)特征,然后將低維特征向量運(yùn)用第三章的LFGL模型進(jìn)行聚類分析.與PCA和LLE的實(shí)驗(yàn)結(jié)果比較發(fā)現(xiàn),該方法在高維數(shù)據(jù)的維度化簡和聚類分析上有更好的表現(xiàn).另外通過比較DDSAE的聚類結(jié)果和LFGL的聚類結(jié)果,我們發(fā)現(xiàn)DDSAE的聚類效果明顯好于LFGL的聚類效果,這也說明了該方法的有效性.
[Abstract]:In recent years, with the rapid development of Internet technology, the scale and dimension of data increase dramatically, which brings about the problem of dimensionality disaster and density sparsity. High dimensional data usually contain many redundant, unrelated features and noises. Clustering analysis of high-dimensional data presents a great challenge. Research shows that the cluster structure of high-dimensional data usually exists in one subspace of the data, rather than in the whole data space. Many subspace clustering methods have been proposed by domestic and foreign researchers. Among them, soft subspace clustering is an important research topic in subspace clustering algorithm, which assigns a weight to each feature of the sample. However, the single feature in high-dimensional data is weak, so it is difficult to find the cluster structure by a single weak feature. The method of weighting a single feature is also not effective in processing data with thousands of features. Many high-dimensional data sets are integrated results from different aspects of observations, so that different aspects of features can be grouped. Moreover, the importance of different feature groups in different clusters is different. Some researchers have proposed a FG-k-means method for assigning weights to feature groups of high-dimensional data, which divides the features into several feature groups. The feature group and the two-level weight of a single feature are introduced to deal with the high-dimensional data and obtain a huge performance improvement. FG-k-means can not realize the automatic grouping of features, which needs to be grouped according to the prior knowledge of human beings. However, for many high-dimensional data sets, We do not know the grouping information of features in advance. In order to solve these problems, we take high-dimensional data as the research object. The main work includes the following two parts: 1) A hidden feature group learning model in subspace clustering is proposed. However, we do not know the grouping information of features in many high-dimensional data. In order to solve these problems, we propose a LFGL model. First, a feature grouping model is constructed, then the feature grouping model is embedded into the subspace clustering algorithm and an optimization problem is constructed. Finally, some optimization algorithms are used to solve the problem under the requirements of the FGM model. Experiments on real data sets such as genes show that LFGL not only realizes automatic grouping of features, but also compares with previous clustering methods. Furthermore, a better clustering effect is obtained. (2) Dimension reduction and clustering analysis based on deep denoising sparse automatic encoder DDSAE) are proposed. There are "dimension disaster" and "sparse density" in the high-dimensional data, and when the dimension increases, The performance of all kinds of clustering methods is obviously decreased, and the ultra high dimensional data runs in a single machine, even memory overflow. In this paper, the L2 norm is introduced to the automatic encoder to prevent over-fitting by using the nonlinear expression ability of the automatic encoder. Noise is added to the input data to improve the robustness of the model, and cross-entropy is used as the loss function. Then the multiple encoders are superposed to form the deep denoising sparse automatic encoders. The deep denoising sparse automatic encoders learn the essential features of the low dimensional abstractions from the high dimensional data. Then the low dimensional eigenvector is analyzed by using the LFGL model in Chapter 3. The results are compared with the experimental results of PCA and LLE. By comparing the clustering results of DDSAE and LFGL, we find that the clustering effect of DDSAE is better than that of LFGL. This also shows the effectiveness of the method.
【學(xué)位授予單位】:深圳大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前1條

1 胡侯立;魏維;胡蒙娜;;深度學(xué)習(xí)算法的原理及應(yīng)用[J];信息技術(shù);2015年02期



本文編號:1561927

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1561927.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶e57d9***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com