超高維數(shù)據(jù)下特征篩選方法的研究與應(yīng)用

發(fā)布時(shí)間：2018-12-05 20:11

【摘要】：隨著大數(shù)據(jù)時(shí)代的到來,在氣象預(yù)測、模式識(shí)別、基因研究等一些領(lǐng)域中,常面臨超高維數(shù)據(jù)。對(duì)于超高維數(shù)據(jù),只有少量的協(xié)變量同響應(yīng)變量之間是相互關(guān)聯(lián)的,模型呈現(xiàn)稀疏性特征,由于維數(shù)過高,傳統(tǒng)的穩(wěn)健的統(tǒng)計(jì)分析方法和高維數(shù)據(jù)變量選擇方法會(huì)變得不再適用。為了更好的對(duì)超高維數(shù)據(jù)進(jìn)行分析,需要對(duì)它進(jìn)行降維處理。近年來很多學(xué)者提出多種便捷的超高維變量篩選方法,一種有效合理的方法是將其分為兩步,首先使用一種快捷高效的變量篩選過程將超高維數(shù)據(jù)降低到樣本大小之下的合適規(guī)模,并能夠保留所有重要變量,在此基礎(chǔ)上再使用一些成熟的方法對(duì)降維后的高維數(shù)據(jù)進(jìn)行變量選擇。本文創(chuàng)新性的提出兩種超高維特征篩選法,在出現(xiàn)異方差、重尾等復(fù)雜超高維數(shù)據(jù)時(shí)基于區(qū)間條件分位數(shù)提出了一種穩(wěn)健的超高維特征篩選方法;當(dāng)面臨響應(yīng)變量隨機(jī)缺失的不完全超高維數(shù)據(jù)問題中,提出一種基于逆概率加權(quán)的邊際相關(guān)度量特征篩選方法。本碩士論文的主體工作如下:第一章概述了超高維數(shù)據(jù)下變量篩選的研究歷史與現(xiàn)狀,以及對(duì)分位數(shù)和缺失數(shù)據(jù)進(jìn)行了系統(tǒng)的回顧與學(xué)習(xí)。第二章提出一種穩(wěn)健的區(qū)間條件分位數(shù)超高維特征篩選法,處理重尾、異常點(diǎn)這些復(fù)雜的超高維數(shù)據(jù)。目前大部分的條件分位數(shù)的研究都是基于一個(gè)單一的分位數(shù)水平下進(jìn)行的,變量的篩選依賴于所提前設(shè)置的分位數(shù),這使得分位數(shù)點(diǎn)的擾動(dòng)可能導(dǎo)致變量篩選的不穩(wěn)定性,本文引入全局分位數(shù)回歸思想,讓分位點(diǎn)取一個(gè)區(qū)間,提出一種基于區(qū)間的條件分位數(shù)篩選方法,使其篩選標(biāo)準(zhǔn)更加準(zhǔn)確,并通過理論證明、模擬研究和實(shí)例說明改進(jìn)后的方法更加穩(wěn)定。第三章提出有關(guān)響應(yīng)變量隨機(jī)缺失的超高維的特征篩選法。在現(xiàn)有的研究工作中,特征篩選研究主要關(guān)注完全數(shù)據(jù)問題,然而,在市場研究調(diào)查、社會(huì)調(diào)查、醫(yī)學(xué)研究領(lǐng)域中經(jīng)常出現(xiàn)響應(yīng)變量隨機(jī)缺失(MAR)的情況,面對(duì)響應(yīng)變量隨機(jī)缺失的數(shù)據(jù),基于逆概率加權(quán)的方法提出一種邊際篩選過程。同樣也通過理論證明、數(shù)值模擬和實(shí)例證明驗(yàn)證了其有效性。第四章對(duì)本文提出的兩種特征篩選方法進(jìn)行了總結(jié),并提出了還可以更加深入地去研究的方向。
[Abstract]:With the advent of big data era, ultra-high dimensional data are often encountered in meteorological prediction, pattern recognition, gene research and other fields. For ultra-high dimensional data, only a small number of covariables are correlated with response variables, and the model is sparse because of its high dimension. Traditional robust statistical analysis methods and high-dimensional data variable selection methods will no longer be applicable. In order to better analyze the ultra-high-dimensional data, it is necessary to reduce the dimension. In recent years, many scholars have proposed a variety of convenient ultra-high dimensional variable screening methods. One effective and reasonable method is to divide them into two steps. First, a fast and efficient variable filtering process is used to reduce the ultra-high dimensional data to an appropriate size below the sample size and to retain all important variables. On the basis of this, some mature methods are used to select the variables of high dimensional data after dimensionality reduction. In this paper, two kinds of ultra-high dimensional feature selection methods are proposed, and a robust ultra-high dimensional feature selection method based on interval conditional quantiles is proposed in the presence of heteroscedasticity and heavy-tailed complex ultra-high dimensional data. In the case of incomplete ultra-high dimensional data with random absence of response variables, a method for feature selection of marginal correlation measures based on inverse probabilistic weighting is proposed. The main work of this thesis is as follows: in Chapter 1, the history and present situation of variable selection under ultra-high dimensional data are summarized, and the quantiles and missing data are systematically reviewed and studied. In chapter 2, we propose a robust feature selection method of interval conditional quantiles, which deals with the complex ultra-high dimensional data such as heavy-tailed and outliers. At present, most of the studies of conditional quantiles are based on a single quantile level. The selection of variables depends on the quantile set in advance, which makes the disturbance of quantile point lead to the instability of variable selection. In this paper, the idea of global quantile regression is introduced, and a conditional quantile screening method based on interval is proposed, which makes the screening criteria more accurate. Simulation studies and examples show that the improved method is more stable. In chapter 3, a feature screening method for random deletion of response variables is proposed. In the current research work, feature screening mainly focuses on the problem of complete data. However, in the field of market research, social research and medical research, the random absence of (MAR) in response variables is often found in the field of market research, social research and medical research. A marginal selection process based on inverse probability weighted method is proposed for randomly missing data with response variables. It is also proved by theory, numerical simulation and practical example to verify its validity. In chapter 4, we summarize the two methods of feature selection, and point out that we can study them more deeply.
【學(xué)位授予單位】：南京信息工程大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：O212

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 武森;馮小東;吳慶海;;基于稀疏指數(shù)排序的高維數(shù)據(jù)并行聚類算法[J];系統(tǒng)工程理論與實(shí)踐;2011年S2期

2 楊力行 ,劉金清;投影尋蹤應(yīng)用技術(shù)在水文領(lǐng)域中喜獲豐收[J];水文;1993年02期

3 蔡利平;周緒川;;高維數(shù)據(jù)上的自適應(yīng)譜聚類降維方法研究[J];西南民族大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年05期

4 毛林;陸全華;程濤;;基于高維數(shù)據(jù)的集成邏輯回歸分類算法的研究與應(yīng)用[J];科技通報(bào);2013年12期

5 陳曉明;;海量高維數(shù)據(jù)下分布式特征選擇算法的研究與應(yīng)用[J];科技通報(bào);2013年08期

6 劉立月;黃兆華;劉遵雄;;高維數(shù)據(jù)分類中的特征降維研究[J];江西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年02期

7 李祚泳;投影尋蹤技術(shù)及其應(yīng)用進(jìn)展[J];自然雜志;1997年04期

8 王家耀;謝明霞;郭建忠;陳科;;基于相似性保持和特征變換的高維數(shù)據(jù)聚類改進(jìn)算法[J];測繪學(xué)報(bào);2011年03期

9 張嬌;裘國永;張奇;;基于二分K均值的SVM決策樹的高維數(shù)據(jù)分類方法[J];赤峰學(xué)院學(xué)報(bào)(自然科學(xué)版);2012年07期

10 周迪斌;蔣健明;胡斌;張量;;基于多GPU的千萬級(jí)高維空間實(shí)時(shí)檢索[J];科技通報(bào);2013年01期

相關(guān)會(huì)議論文前6條

1 周煜人;彭輝;桂衛(wèi)華;;基于映射的高維數(shù)據(jù)聚類方法[A];04'中國企業(yè)自動(dòng)化和信息化建設(shè)論壇暨中南六省區(qū)自動(dòng)化學(xué)會(huì)學(xué)術(shù)年會(huì)專輯[C];2004年

2 梁俊杰;楊澤新;馮玉才;;大規(guī)模高維數(shù)據(jù)庫索引結(jié)構(gòu)[A];第二十三屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2006年

3 陳冠華;馬秀莉;楊冬青;唐世渭;帥猛;;面向高維數(shù)據(jù)的低冗余Top-k異常點(diǎn)發(fā)現(xiàn)方法[A];第26屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（A輯）[C];2009年

4 劉運(yùn)濤;鮑玉斌;吳丹;冷芳玲;孫煥良;于戈;;CBFrag-Cubing:一種基于壓縮位圖的高維數(shù)據(jù)立方創(chuàng)建算法(英文)[A];第二十二屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2005年

5 劉文慧;;PCA與PLS用于高維數(shù)據(jù)分類的比較性研究[A];2011年中國衛(wèi)生統(tǒng)計(jì)學(xué)年會(huì)會(huì)議論文集[C];2011年

6 劉喜蘭;馮德益;王公恕;朱成喜;馮雯;;臉譜分析在中進(jìn)期地震跟蹤預(yù)報(bào)中的應(yīng)用[A];中國地震學(xué)會(huì)第四次學(xué)術(shù)大會(huì)論文摘要集[C];1992年

相關(guān)重要報(bào)紙文章前1條

1 本報(bào)記者李雙藝;引領(lǐng)高維數(shù)據(jù)分析先河[N];吉林日?qǐng)?bào);2013年

相關(guān)博士學(xué)位論文前10條

1 劉勝藍(lán);余弦度量下的高維數(shù)據(jù)降維及分類方法研究[D];大連理工大學(xué);2015年

2 黃曉輝;高維數(shù)據(jù)的若干聚類問題及算法研究[D];哈爾濱工業(yè)大學(xué);2015年

3 楊崇;高維數(shù)據(jù)流上的K近鄰問題研究[D];山東大學(xué);2016年

4 路梅;面向高維數(shù)據(jù)的特征學(xué)習(xí)理論與應(yīng)用研究[D];蘇州大學(xué);2016年

5 徐微微;高維數(shù)據(jù)降維可視化研究及其在生物醫(yī)學(xué)中的應(yīng)用[D];武漢大學(xué);2016年

6 連亦e，

本文編號(hào)：2365405

資料下載