基于冗余—互補(bǔ)散度及特征包絡(luò)前沿的數(shù)據(jù)驅(qū)動(dòng)特征選擇方法研究
本文選題:數(shù)據(jù)驅(qū)動(dòng) + 特征選擇。 參考:《華中科技大學(xué)》2016年博士論文
【摘要】:隨著社會(huì)的不斷發(fā)展,數(shù)據(jù)的構(gòu)成呈現(xiàn)復(fù)雜化與高維化的趨勢(shì),大數(shù)據(jù)降維研究中應(yīng)用廣泛的特征選擇算法已經(jīng)成為大數(shù)據(jù)和數(shù)據(jù)驅(qū)動(dòng)背景下社會(huì)經(jīng)濟(jì)決策和企業(yè)商務(wù)決策重要的研究方向。特征選擇方法中的參數(shù)選擇問(wèn)題對(duì)所選特征質(zhì)量以及數(shù)據(jù)的再表達(dá)有著重要的影響。特征集合S=F1,...,Fk和類C的聯(lián)合互信息可以展開(kāi)為不同維度(階)上特征與類的交互信息的和,于是,特征集合與類之間的聯(lián)合互信息可以表現(xiàn)為交互信息的展開(kāi)形式。從(2012)的視角來(lái)看,參數(shù)的確定問(wèn)題也即選擇特征選擇方法的方法問(wèn)題,但在這些經(jīng)典特征選擇方法中,存在先驗(yàn)性的參數(shù)選擇問(wèn)題,例如MIFS中冗余性權(quán)重口等。因此,如何從彌補(bǔ)高階交互項(xiàng)缺失的視角來(lái)尋找合適的、非先驗(yàn)性的權(quán)重是特征選擇的一個(gè)重大問(wèn)題。給出了兩個(gè)如何解決特征選擇參數(shù)問(wèn)題的框架。其一,從數(shù)據(jù)驅(qū)動(dòng)的視角,將參數(shù)的衍生視為對(duì)高階交互信息的省略所造成的偏差的修正。在給出了數(shù)據(jù)驅(qū)動(dòng)的基于互信息的特征評(píng)價(jià)框架的基礎(chǔ)上,深入分析了由高階信息缺失所帶來(lái)的冗余-互補(bǔ)分散現(xiàn)象,在冗余-互補(bǔ)維度上引入高階信息驅(qū)動(dòng)的修正因子對(duì)低階冗余-互補(bǔ)項(xiàng)進(jìn)行修正(參數(shù)的確定),進(jìn)而對(duì)特征進(jìn)行準(zhǔn)確地評(píng)價(jià)與排序。其二,結(jié)合特征選擇中多指標(biāo)評(píng)價(jià)及指標(biāo)權(quán)重的多樣性及其不同領(lǐng)域不同時(shí)段的偏向性,構(gòu)建了一種基于DEA的特征選擇框架,該框架充分利用了DEA框架的數(shù)據(jù)驅(qū)動(dòng)特性,使其在進(jìn)行特征評(píng)價(jià)和選擇時(shí)能夠充分考慮到特征間關(guān)系多樣性以及特征評(píng)價(jià)準(zhǔn)則多樣性特點(diǎn),同時(shí)還能應(yīng)對(duì)不同數(shù)據(jù)環(huán)境所帶來(lái)的變化。依據(jù)第一個(gè)框架,從省略高階交互信息所造成的冗余-互補(bǔ)分散現(xiàn)象出發(fā)實(shí)現(xiàn)特征選擇參數(shù)的確定。對(duì)由高階信息缺失所帶來(lái)的冗余-互補(bǔ)分散現(xiàn)象進(jìn)行了深入探討,基于高階互信息在低階的“投影”視角,從高階互信息缺失在低階上的“投影”所造成的低階上特征間的冗余-互補(bǔ)分散現(xiàn)象進(jìn)行判斷,并據(jù)此進(jìn)行低階項(xiàng)參數(shù)的確定;進(jìn)而提出了基于冗余-互補(bǔ)散度的數(shù)據(jù)驅(qū)動(dòng)特征選擇方法(Redundancy-Complementariness Dispersion-based Feature Selection method, RCDFS),該算法考慮到現(xiàn)有統(tǒng)計(jì)方法對(duì)高階項(xiàng)的估計(jì)存在不可預(yù)料的錯(cuò)誤,通過(guò)數(shù)據(jù)驅(qū)動(dòng)的方式為2階近似特征冗余-互補(bǔ)關(guān)系給出一個(gè)系數(shù)(權(quán)重),對(duì)該項(xiàng)因高階項(xiàng)缺失所帶來(lái)的偏差給予了恰當(dāng)?shù)膹浹a(bǔ)。證明了采用“求平均”方法的特征評(píng)價(jià)準(zhǔn)則可以保證獲取高階冗余性和互補(bǔ)性的下界,為有效的數(shù)據(jù)驅(qū)動(dòng)特征評(píng)價(jià)準(zhǔn)則整合方法打下了基礎(chǔ)。鑒于不同背景所對(duì)應(yīng)的評(píng)價(jià)準(zhǔn)則及特征關(guān)聯(lián)偏向的“先驗(yàn)知識(shí)”蘊(yùn)藏于該背景下的具體數(shù)據(jù)之中,于是根據(jù)給出的第二個(gè)框架,構(gòu)建了用于特征選擇的基于DEA的超效率特征評(píng)價(jià)模型。該模型可面向不同領(lǐng)域的具體數(shù)據(jù),通過(guò)超效率DEA對(duì)這些評(píng)價(jià)準(zhǔn)則選擇合適的參數(shù)并構(gòu)造出相應(yīng)的超效率包絡(luò)前沿,進(jìn)而實(shí)現(xiàn)對(duì)特征的評(píng)價(jià)和排序。同時(shí)還給出了相應(yīng)的求解MCSD算法,討論了算法的復(fù)雜性。實(shí)驗(yàn)結(jié)果表明,所提MCSD算法所對(duì)應(yīng)的分類結(jié)果在絕大多數(shù)情況下顯著優(yōu)于IG、ReliefF、CMIM和JMI的結(jié)果?焖侔l(fā)展的公路運(yùn)輸業(yè)帶來(lái)了交通事故的持續(xù)增長(zhǎng)。駕駛員的不良駕駛行為是一些重大交通事故的誘因,因此通過(guò)動(dòng)態(tài)監(jiān)控?cái)?shù)據(jù)進(jìn)行駕駛員異常駕駛行為的辨識(shí)與分析,特別是對(duì)于一些需要重點(diǎn)監(jiān)控的異常駕駛行為的識(shí)別與分析,意義十分重大。根據(jù)Wright等(2009)和Mo等(2014)的理論,任何一條新的車輛運(yùn)動(dòng)軌跡都可以近似的用訓(xùn)練車輛運(yùn)動(dòng)軌跡線性組合而成,因此,稀疏重構(gòu)技術(shù)可以被應(yīng)用于軌跡識(shí)別與行為分類中?紤]到大量冗余車輛軌跡特征的存在會(huì)對(duì)軌跡學(xué)習(xí)模型的準(zhǔn)確性造成嚴(yán)重的影響,同時(shí)基于稀疏重構(gòu)軌跡學(xué)習(xí)模型在求解速率上的短板更是彰顯了特征選擇在建模和處理過(guò)程中的重要性。鑒于此,在l2-lp稀疏重構(gòu)方法的軌跡識(shí)別模型中嵌入了特征選擇方法,并采用前面所提出的數(shù)據(jù)驅(qū)動(dòng)特征選擇算法予以實(shí)現(xiàn):提出了求解基于lp(0p1)范數(shù)的稀疏重構(gòu)系數(shù)向量的方法Orthogonal Matching Pursuit-quasi-Newton (OMPN),該方法首先采用正交匹配貪婪算法(Orthogonal Matching Pursuit, OMP)搜索出一個(gè)初始可行解,然后采用擬牛頓法進(jìn)一步搜索稀疏解。最后,根據(jù)lp(0p1)范數(shù)稀疏問(wèn)題的局部最優(yōu)解在一定的條件下與其精確解的關(guān)系來(lái)最終獲取更加稀疏的解。實(shí)驗(yàn)結(jié)果表明了所提出的框架和方法效果的優(yōu)越性。同時(shí),實(shí)驗(yàn)結(jié)果也顯示了嵌入特征選擇后的結(jié)果要優(yōu)于沒(méi)有嵌入特征選擇方法時(shí)的結(jié)果,表明了所提數(shù)據(jù)驅(qū)動(dòng)的特征選擇方法在交通安全管理領(lǐng)域中有著重要的理論意義和廣闊的應(yīng)用空間。
[Abstract]:With the continuous development of the society, the composition of data presents a trend of complexity and high maintenance. The widely used feature selection algorithm in the large data reduction research has become an important research direction in the social and economic decision-making and business decision making under the background of large data and data driven. The quality and the re expression of data have an important influence. The joint mutual information of the feature set S=F1, Fk and the class C can be expanded to the sum of the interactive information of the characteristics and classes on the different dimension (order). Therefore, the joint mutual information between the feature set and the class can be shown as the expansion of the interactive information. From the perspective of (2012), the parameters The problem of determining the problem is the method of selecting the feature selection method, but in these classical feature selection methods, there is a priori parameter selection problem, such as the redundant weighting mouth in MIFS. Therefore, it is a major problem to find the right non priori weight from the perspective of making up the missing of the high order interaction. Two frameworks to solve the problem of characteristic selection parameters are given. First, from the data driven perspective, the derivation of the parameters is considered as a correction of the deviation caused by the ellipsis of high order interactive information. On the basis of a data driven feature evaluation framework based on mutual information, an in-depth analysis is made of the lack of high order information. The redundant complementary dispersion phenomenon is introduced into the redundancy complementary dimension by introducing the high order information driven correction factor to the low order redundancy complementary term (parameter determination), and then the characteristics are accurately evaluated and ordered. Secondly, the multiple index evaluation and the diversity of the index weight and the deviation of different periods in different fields are combined. In nature, a feature selection framework based on DEA is constructed. The framework makes full use of the data driven characteristics of the DEA framework so that it can take full account of the diversity of features and the diversity of feature evaluation criteria when evaluating and selecting the features, and can also bring about changes to different data environments. Based on the redundancy and complementary dispersion caused by the ellipsis of high order interactive information, a framework is used to determine the feature selection parameters. The redundant complementary dispersion, which is caused by the absence of high order information, is deeply discussed. Based on the high order mutual information in the low order "projection" perspective, the "projection" of high order mutual information is not in the lower order. "The redundant complementary dispersion phenomenon between low order upper features is judged, and the parameters of low order terms are determined accordingly. Then a data driven feature selection method based on redundancy complementary divergence (Redundancy-Complementariness Dispersion-based Feature Selection method, RCDFS) is proposed. The algorithm takes into account the existing statistics. In this method, there is an unpredictable error in the estimation of higher order terms. A coefficient (weight) is given for the 2 order approximate characteristic redundancy complementary relation by data driven method, which is properly compensated for the deviation caused by the absence of high order terms. The lower bound of redundancy and complementarity lays a foundation for effective integration of data driven feature evaluation criteria. In view of the corresponding evaluation criteria for different backgrounds and the "prior knowledge" of characteristic association bias in the specific data under this background, a basis for feature selection is constructed based on the second frameworks given. The model of DEA's super efficiency feature evaluation model. This model can be oriented to the specific data in different fields. Through the super efficiency DEA, the appropriate parameters are selected and the corresponding super efficiency envelope frontiers are constructed, then the evaluation and sorting of the characteristics are realized. At the same time, the corresponding solution of MCSD algorithm is given, and the complexity of the algorithm is discussed. The experimental results show that the classification results of the proposed MCSD algorithm are significantly better than the results of IG, ReliefF, CMIM and JMI in most cases. The rapid development of highway transportation brings about the continuous increase of traffic accidents. The driver's bad driving behavior is the cause of some important traffic accidents, so driving through dynamic monitoring data is carried out. The identification and analysis of the abnormal driving behavior of the driver, especially for the identification and analysis of some abnormal driving behaviors which need to be monitored and monitored, is of great significance. According to the theory of Wright (2009) and Mo (2014), any new vehicle trajectory can be approximated by a linear combination of the track of the training vehicle. Therefore, sparsity is sparse. Reconfiguration technology can be applied to trajectory recognition and behavior classification. Considering the existence of a large number of redundant vehicle trajectories, the accuracy of the trajectory learning model is seriously affected. At the same time, the short plate based on sparse reconstruction trajectory learning model shows the importance of feature selection in the process of modeling and processing. In view of this, the feature selection method is embedded in the trajectory recognition model of the l2-lp sparse reconstruction method, and the data driven feature selection algorithm proposed before is implemented. The method of solving the sparse reconstruction coefficient vector based on the LP (0p1) norm is proposed, Orthogonal Matching Pursuit-quasi-Newton (OMPN). Orthogonal Matching Pursuit (OMP) is used to find an initial feasible solution, and then the quasi Newton method is used to further search the sparse solution. Finally, the local optimal solution of the LP (0p1) norm sparsity problem is based on the relationship between the exact solution and the exact solution. The experimental results show that the solution is more sparse. At the same time, the experimental results also show that the result of the embedded feature selection is better than the result without the embedded feature selection method. It shows that the proposed data driven feature selection method has the important theoretical significance and wide application space in the field of traffic safety management.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫霞;鄭慶華;;一種面向非平衡數(shù)據(jù)的鄰居詞特征選擇方法[J];小型微型計(jì)算機(jī)系統(tǒng);2008年12期
2 蔣盛益;鄭琪;張倩生;;基于聚類的特征選擇方法[J];電子學(xué)報(bào);2008年S1期
3 王加龍;朱顥東;;結(jié)合類別相關(guān)性和辨識(shí)集的特征選擇方法[J];微型機(jī)與應(yīng)用;2009年23期
4 朱顥東;周姝;鐘勇;;結(jié)合差別對(duì)象對(duì)集的綜合性特征選擇方法[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年03期
5 姜慧研;柴天佑;;基于可信間隔的特征選擇方法研究[J];控制與決策;2011年08期
6 姚旭;王曉丹;張玉璽;權(quán)文;;特征選擇方法綜述[J];控制與決策;2012年02期
7 王志昊;王中卿;李壽山;李培峰;;不平衡情感分類中的特征選擇方法研究[J];中文信息學(xué)報(bào);2013年04期
8 張玉紅;周全;胡學(xué)鋼;;面向跨領(lǐng)域情感分類的特征選擇方法[J];模式識(shí)別與人工智能;2013年11期
9 李敏;卡米力·木依丁;;特征選擇方法與算法的研究[J];計(jì)算機(jī)技術(shù)與發(fā)展;2013年12期
10 申清明;閆利軍;高建民;趙靜;;基于混沌搜索的特征選擇方法[J];兵工學(xué)報(bào);2013年12期
相關(guān)會(huì)議論文 前6條
1 徐燕;王斌;李錦濤;孫春明;;知識(shí)增益:文本分類中一種新的特征選擇方法[A];第三屆全國(guó)信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集[C];2007年
2 肖婷;唐雁;;文本分類中特征選擇方法及應(yīng)用[A];2008年計(jì)算機(jī)應(yīng)用技術(shù)交流會(huì)論文集[C];2008年
3 徐燕;孫春明;王斌;李錦濤;;基于詞條頻率的特征選擇算法研究[A];中文信息處理前沿進(jìn)展——中國(guó)中文信息學(xué)會(huì)二十五周年學(xué)術(shù)會(huì)議論文集[C];2006年
4 陳慶軒;鄭德權(quán);鄭博文;趙鐵軍;李生;;中文文本分類中基于文檔頻度分布的特征選擇方法[A];黑龍江省計(jì)算機(jī)學(xué)會(huì)2009年學(xué)術(shù)交流年會(huì)論文集[C];2010年
5 顧成杰;張順頤;劉凱;黃河;;基于粗糙集和禁忌搜索的特征選擇方法[A];江蘇省電子學(xué)會(huì)2010年學(xué)術(shù)年會(huì)論文集[C];2010年
6 王秀娟;郭軍;鄭康鋒;;基于互信息可信度的特征選擇方法[A];2006通信理論與技術(shù)新進(jìn)展——第十一屆全國(guó)青年通信學(xué)術(shù)會(huì)議論文集[C];2006年
相關(guān)博士學(xué)位論文 前5條
1 張逸石;基于冗余—互補(bǔ)散度及特征包絡(luò)前沿的數(shù)據(jù)驅(qū)動(dòng)特征選擇方法研究[D];華中科技大學(xué);2016年
2 毛勇;基于支持向量機(jī)的特征選擇方法的研究與應(yīng)用[D];浙江大學(xué);2006年
3 尹留志;關(guān)于非平衡數(shù)據(jù)特征問(wèn)題的研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2014年
4 裴志利;數(shù)據(jù)挖掘技術(shù)在文本分類和生物信息學(xué)中的應(yīng)用[D];吉林大學(xué);2008年
5 劉明霞;屬性學(xué)習(xí)若干重要問(wèn)題的研究及應(yīng)用[D];南京航空航天大學(xué);2015年
相關(guān)碩士學(xué)位論文 前10條
1 曹晉;基于SVDD的特征選擇方法研究及其應(yīng)用[D];蘇州大學(xué);2015年
2 張強(qiáng);靜態(tài)圖像上的行人檢測(cè)方法研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2015年
3 張曉梅;基于融合特征的微博主客觀分類方法研究[D];山西大學(xué);2014年
4 王君;基于SVM-RFE的特征選擇方法研究[D];大連理工大學(xué);2015年
5 于海珠;面向文本聚類的特征選擇方法及應(yīng)用研究[D];大連理工大學(xué);2015年
6 趙世琛;文本分類中特征選擇方法研究[D];山西大學(xué);2014年
7 王丹;特征選擇算法研究及其在異常檢測(cè)中的應(yīng)用[D];電子科技大學(xué);2014年
8 林艷峰;中文文本分類特征選擇方法的研究與實(shí)現(xiàn)[D];西安電子科技大學(xué);2014年
9 盧志浩;基于GEP的kNN算法改進(jìn)研究[D];廣西師范學(xué)院;2015年
10 王立鵬;面向圖數(shù)據(jù)的特征選擇方法及其應(yīng)用研究[D];南京航空航天大學(xué);2015年
,本文編號(hào):1954667
本文鏈接:http://sikaile.net/jingjilunwen/jiliangjingjilunwen/1954667.html