基于重構信息保持的降維算法研究
發(fā)布時間:2018-05-28 07:08
本文選題:降維 + 特征提取; 參考:《山東師范大學》2017年碩士論文
【摘要】:隨著網(wǎng)絡和存儲技術的不斷發(fā)展,越來越多的數(shù)據(jù)呈現(xiàn)出數(shù)據(jù)量大、維數(shù)高等新的特點。這些海量的高維數(shù)據(jù)包含更加豐富信息的同時,也帶來了如維數(shù)災難、計算量大等問題,對數(shù)據(jù)分析提出了新的挑戰(zhàn)。因此,如何能夠有效地描述高維數(shù)據(jù)并挖掘出其中有意義的信息成為亟待解決的問題。降維作為解決該問題的有效手段之一,在人臉識別、生物信息學、圖像檢索等領域都有著廣泛的應用。近年來,隨著降維技術的發(fā)展,人們對降維算法的要求逐漸提高,降維算法的優(yōu)劣直接關系到對數(shù)據(jù)信息提取和分析的準確性。本文以提高維數(shù)據(jù)在降維后的可分性為目標,針對數(shù)據(jù)集的特殊性,在保持數(shù)據(jù)重構信息的基礎上,提出兩種不同的降維算法,并分別在不同數(shù)據(jù)集上對所提出方法的準確性和可靠性進行驗證及分析。本文的主要工作及創(chuàng)新點概括如下:1.提出一種基于全局距離和類別信息的鄰域保持嵌入算法(Neighborhood Preserving Embedding Algorithm based on Global Distance and Label Information,GLI-NPE)。GLI-NPE算法在鄰域保持嵌入算法通過傳統(tǒng)歐氏距離構造鄰域圖的公式中,加入表征全局距離的全局因子和表示數(shù)據(jù)類別信息的函數(shù)項。全局因子使分布不均勻的樣本變得平滑均勻,使鄰域保持嵌入算法在分布不均勻的樣本上更為魯棒。類別信息使類內樣本點且緊湊類間樣本點疏離,通過提高所選鄰近點的質量,優(yōu)化數(shù)據(jù)的局部鄰域,使降維后的數(shù)據(jù)具有更好的可分性。實驗結果表明,GLI-NPE算法能夠有效提高數(shù)據(jù)降維后的分類準確率。2.針對高維的基因表達數(shù)據(jù),立足于對數(shù)據(jù)進行維數(shù)約減的同時提高腫瘤數(shù)據(jù)的可分性,同時分析稀疏表示與近鄰表示各自的局限性以及腫瘤數(shù)據(jù)中分類的獨特性,提出一種基于判別混合結構保持投影(Discriminative Hybrid Structure Preserving Projections,DHSPP)的特征提取算法。DHSPP算法將稀疏表示與近鄰表示線性組合成一種混合表示,然后根據(jù)類別信息將混合表示分為類內混合表示和類間混合表示,以最大化類間距離最小化類內距離為原則構造目標函數(shù)。此外,鑒于腫瘤數(shù)據(jù)大多為不平衡數(shù)據(jù),在計算類內距離時加入平衡調節(jié)因子平衡多數(shù)類與少數(shù)類。實驗結果表明,通過DHSPP算法對腫瘤表達數(shù)據(jù)進行降維,能夠有效提高降維后腫瘤數(shù)據(jù)的分類準確率。
[Abstract]:With the development of network and storage technology, more and more data show new characteristics of large data volume and high dimension. These massive high-dimensional data not only contain more information, but also bring problems such as dimensionality disaster and large amount of computation, which pose a new challenge to data analysis. Therefore, how to effectively describe high-dimensional data and mine meaningful information is an urgent problem to be solved. As one of the effective methods to solve this problem, dimensionality reduction is widely used in face recognition, bioinformatics, image retrieval and so on. In recent years, with the development of dimensionality reduction technology, the demand for dimensionality reduction algorithm has been gradually raised. The advantages and disadvantages of dimensionality reduction algorithm are directly related to the accuracy of data information extraction and analysis. In this paper, aiming at improving the separability of dimensionally reduced data, aiming at the particularity of data set, two different dimensionality reduction algorithms are proposed on the basis of preserving the information of data reconstruction. The accuracy and reliability of the proposed method are verified and analyzed on different data sets. The main work and innovation of this paper are summarized as follows: 1. In this paper, a neighborhood preserving embedding algorithm based on global distance and class information is proposed, which is based on neighborhood Preserving Embedding Algorithm based on Global Distance and Label Information (GLI-NPEN). GLI-NPE algorithm is used to construct neighborhood graph by traditional Euclidean distance. A global factor representing the global distance and a function item representing data class information are added. The global factor makes the unevenly distributed samples smooth and uniform, and makes the neighborhood retention embedding algorithm more robust on the unevenly distributed samples. Class information alienates the sample points within classes and compactly between classes. By improving the quality of the selected adjacent points and optimizing the local neighborhood of the data, the reduced dimension data has better separability. Experimental results show that the GLI-NPE algorithm can effectively improve the classification accuracy. 2. For high-dimensional gene expression data, based on reducing the dimension of the data and improving the separability of tumor data, the limitations of sparse representation and nearest neighbor representation and the uniqueness of classification in tumor data are analyzed. A feature extraction algorithm based on discriminant mixed structure preserving projection Hybrid Structure Preserving projects (DHSPP) is proposed. DHSPP algorithm combines sparse representation with nearest neighbor representation to form a mixed representation. Then the mixed representation is divided into intra-class mixed representation and inter-class hybrid representation according to the class information. The objective function is constructed based on the principle of maximizing inter-class distance and minimizing intra-class distance. In addition, in view of the fact that the tumor data are mostly unbalanced, the balance regulator balance most and few classes are added in the calculation of intra-class distance. The experimental results show that the dimensionality reduction of tumor expression data by DHSPP algorithm can effectively improve the classification accuracy of tumor data after dimensionality reduction.
【學位授予單位】:山東師范大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP301.6
【參考文獻】
相關期刊論文 前1條
1 梅清琳;張化祥;;基于全局距離和類別信息的鄰域保持嵌入算法[J];山東大學學報(工學版);2016年01期
,本文編號:1945781
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1945781.html
最近更新
教材專著