超高維數據下特征篩選方法的研究與應用
[Abstract]:With the advent of big data era, ultra-high dimensional data are often encountered in meteorological prediction, pattern recognition, gene research and other fields. For ultra-high dimensional data, only a small number of covariables are correlated with response variables, and the model is sparse because of its high dimension. Traditional robust statistical analysis methods and high-dimensional data variable selection methods will no longer be applicable. In order to better analyze the ultra-high-dimensional data, it is necessary to reduce the dimension. In recent years, many scholars have proposed a variety of convenient ultra-high dimensional variable screening methods. One effective and reasonable method is to divide them into two steps. First, a fast and efficient variable filtering process is used to reduce the ultra-high dimensional data to an appropriate size below the sample size and to retain all important variables. On the basis of this, some mature methods are used to select the variables of high dimensional data after dimensionality reduction. In this paper, two kinds of ultra-high dimensional feature selection methods are proposed, and a robust ultra-high dimensional feature selection method based on interval conditional quantiles is proposed in the presence of heteroscedasticity and heavy-tailed complex ultra-high dimensional data. In the case of incomplete ultra-high dimensional data with random absence of response variables, a method for feature selection of marginal correlation measures based on inverse probabilistic weighting is proposed. The main work of this thesis is as follows: in Chapter 1, the history and present situation of variable selection under ultra-high dimensional data are summarized, and the quantiles and missing data are systematically reviewed and studied. In chapter 2, we propose a robust feature selection method of interval conditional quantiles, which deals with the complex ultra-high dimensional data such as heavy-tailed and outliers. At present, most of the studies of conditional quantiles are based on a single quantile level. The selection of variables depends on the quantile set in advance, which makes the disturbance of quantile point lead to the instability of variable selection. In this paper, the idea of global quantile regression is introduced, and a conditional quantile screening method based on interval is proposed, which makes the screening criteria more accurate. Simulation studies and examples show that the improved method is more stable. In chapter 3, a feature screening method for random deletion of response variables is proposed. In the current research work, feature screening mainly focuses on the problem of complete data. However, in the field of market research, social research and medical research, the random absence of (MAR) in response variables is often found in the field of market research, social research and medical research. A marginal selection process based on inverse probability weighted method is proposed for randomly missing data with response variables. It is also proved by theory, numerical simulation and practical example to verify its validity. In chapter 4, we summarize the two methods of feature selection, and point out that we can study them more deeply.
【學位授予單位】:南京信息工程大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:O212
【相似文獻】
相關期刊論文 前10條
1 武森;馮小東;吳慶海;;基于稀疏指數排序的高維數據并行聚類算法[J];系統(tǒng)工程理論與實踐;2011年S2期
2 楊力行 ,劉金清;投影尋蹤應用技術在水文領域中喜獲豐收[J];水文;1993年02期
3 蔡利平;周緒川;;高維數據上的自適應譜聚類降維方法研究[J];西南民族大學學報(自然科學版);2010年05期
4 毛林;陸全華;程濤;;基于高維數據的集成邏輯回歸分類算法的研究與應用[J];科技通報;2013年12期
5 陳曉明;;海量高維數據下分布式特征選擇算法的研究與應用[J];科技通報;2013年08期
6 劉立月;黃兆華;劉遵雄;;高維數據分類中的特征降維研究[J];江西師范大學學報(自然科學版);2012年02期
7 李祚泳;投影尋蹤技術及其應用進展[J];自然雜志;1997年04期
8 王家耀;謝明霞;郭建忠;陳科;;基于相似性保持和特征變換的高維數據聚類改進算法[J];測繪學報;2011年03期
9 張嬌;裘國永;張奇;;基于二分K均值的SVM決策樹的高維數據分類方法[J];赤峰學院學報(自然科學版);2012年07期
10 周迪斌;蔣健明;胡斌;張量;;基于多GPU的千萬級高維空間實時檢索[J];科技通報;2013年01期
相關會議論文 前6條
1 周煜人;彭輝;桂衛(wèi)華;;基于映射的高維數據聚類方法[A];04'中國企業(yè)自動化和信息化建設論壇暨中南六省區(qū)自動化學會學術年會專輯[C];2004年
2 梁俊杰;楊澤新;馮玉才;;大規(guī)模高維數據庫索引結構[A];第二十三屆中國數據庫學術會議論文集(研究報告篇)[C];2006年
3 陳冠華;馬秀莉;楊冬青;唐世渭;帥猛;;面向高維數據的低冗余Top-k異常點發(fā)現方法[A];第26屆中國數據庫學術會議論文集(A輯)[C];2009年
4 劉運濤;鮑玉斌;吳丹;冷芳玲;孫煥良;于戈;;CBFrag-Cubing:一種基于壓縮位圖的高維數據立方創(chuàng)建算法(英文)[A];第二十二屆中國數據庫學術會議論文集(研究報告篇)[C];2005年
5 劉文慧;;PCA與PLS用于高維數據分類的比較性研究[A];2011年中國衛(wèi)生統(tǒng)計學年會會議論文集[C];2011年
6 劉喜蘭;馮德益;王公恕;朱成喜;馮雯;;臉譜分析在中進期地震跟蹤預報中的應用[A];中國地震學會第四次學術大會論文摘要集[C];1992年
相關重要報紙文章 前1條
1 本報記者 李雙藝;引領高維數據分析先河[N];吉林日報;2013年
相關博士學位論文 前10條
1 劉勝藍;余弦度量下的高維數據降維及分類方法研究[D];大連理工大學;2015年
2 黃曉輝;高維數據的若干聚類問題及算法研究[D];哈爾濱工業(yè)大學;2015年
3 楊崇;高維數據流上的K近鄰問題研究[D];山東大學;2016年
4 路梅;面向高維數據的特征學習理論與應用研究[D];蘇州大學;2016年
5 徐微微;高維數據降維可視化研究及其在生物醫(yī)學中的應用[D];武漢大學;2016年
6 連亦e,
本文編號:2365405
本文鏈接:http://sikaile.net/kejilunwen/yysx/2365405.html