天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 自動(dòng)化論文 >

基于特征抽取的集成學(xué)習(xí)算法研究

發(fā)布時(shí)間:2018-07-04 10:42

  本文選題:集成學(xué)習(xí) + 特征抽取。 參考:《山東師范大學(xué)》2017年碩士論文


【摘要】:學(xué)習(xí)系統(tǒng)泛化能力的提升一直是機(jī)器學(xué)習(xí)研究的重點(diǎn)。單一分類器無(wú)法避免的局限和不足導(dǎo)致其分類性能的提升遇到瓶頸。集成學(xué)習(xí)作為新的機(jī)器學(xué)習(xí)模式,采用若干個(gè)單一分類器預(yù)測(cè)同一問題,分類結(jié)果由各學(xué)習(xí)器共同決定,并按某種規(guī)則進(jìn)行集成。集成學(xué)習(xí)使得各分類器優(yōu)勢(shì)互補(bǔ),極大提升了分類系統(tǒng)的泛化能力和分類性能,被廣泛應(yīng)用于生物醫(yī)學(xué)、信息科學(xué)等各個(gè)領(lǐng)域。隨著互聯(lián)網(wǎng)技術(shù)向社會(huì)生活各個(gè)領(lǐng)域滲透,待處理的數(shù)據(jù)也變得愈加復(fù)雜。其中,不平衡數(shù)據(jù)、高維數(shù)據(jù)、噪聲數(shù)據(jù)等各種類型數(shù)據(jù)普遍存在。傳統(tǒng)的集成學(xué)習(xí)方法處理規(guī)范數(shù)據(jù)性能較好,而對(duì)于復(fù)雜數(shù)據(jù)分類效果有限。因此,在集成學(xué)習(xí)中融入數(shù)據(jù)處理方法顯得尤為重要。特征抽取是數(shù)據(jù)分析處理的重要手段之一,在數(shù)據(jù)降維,消除噪聲冗余等方面有著廣泛的應(yīng)用。本文在對(duì)集成學(xué)習(xí)算法深入研究的基礎(chǔ)上,將特征抽取等數(shù)據(jù)處理算法與集成學(xué)習(xí)算法相結(jié)合,提出了改進(jìn)后的集成學(xué)習(xí)算法,具體如下:不平衡數(shù)據(jù)通常會(huì)導(dǎo)致分類器對(duì)少數(shù)類樣本分類效果較差。為了降低數(shù)據(jù)集的不平衡比例,可以采用SMOTE過采樣算法對(duì)數(shù)據(jù)預(yù)處理。本文使用獨(dú)立成分分析算法(ICA)消除數(shù)據(jù)噪聲,同時(shí)融入SMOTE算法平衡數(shù)據(jù),使得處理后的數(shù)據(jù)對(duì)集成學(xué)習(xí)算法具有較好的適應(yīng)性。實(shí)驗(yàn)結(jié)果表明,本文提出的方法能顯著提升集成學(xué)習(xí)算法Bagging對(duì)不平衡數(shù)據(jù)的分類性能。不同類型的數(shù)據(jù)都存在一定的組織方式和結(jié)構(gòu)信息,屬性之間相互關(guān)聯(lián)。經(jīng)過研究分析,垃圾網(wǎng)頁(yè)數(shù)據(jù)集特征屬性不僅維度高而且關(guān)聯(lián)度也較高。針對(duì)垃圾網(wǎng)頁(yè)內(nèi)容特征和鏈接特征之間的高維性和關(guān)聯(lián)性,本文在對(duì)垃圾網(wǎng)頁(yè)特征屬性深入研究的基礎(chǔ)上,對(duì)其關(guān)聯(lián)屬性分組進(jìn)行主成分分析(PCA),而非整體主成分分析。這在降低維度的同時(shí),一定程度的保護(hù)了數(shù)據(jù)集原有的屬性結(jié)構(gòu)。實(shí)驗(yàn)結(jié)果表明,本文提出的方法在應(yīng)用于垃圾網(wǎng)頁(yè)分類時(shí)具有較好的性能。
[Abstract]:The improvement of generalization ability of learning system has been the focus of machine learning research. The limitation and deficiency of single classifier lead to the bottleneck of its classification performance. As a new machine learning model, ensemble learning uses several single classifiers to predict the same problem. Ensemble learning makes each classifier complement each other, greatly improves the generalization ability and classification performance of classification system, and is widely used in biomedicine, information science and other fields. As Internet technology penetrates into all areas of social life, the data to be processed become more complex. Among them, unbalanced data, high-dimensional data, noise data and other types of data generally exist. Traditional ensemble learning methods have better performance for standard data processing, but limited effect for complex data classification. Therefore, it is very important to integrate data processing methods into integrated learning. Feature extraction is one of the most important methods in data analysis and processing. It is widely used in data dimensionality reduction, noise redundancy elimination and so on. Based on the in-depth study of the integrated learning algorithm, this paper combines the feature extraction and other data processing algorithms with the integrated learning algorithm, and proposes an improved ensemble learning algorithm. The main results are as follows: unbalanced data usually lead to poor classification performance for a few samples. In order to reduce the imbalance ratio of data sets, SMOTE oversampling algorithm can be used to preprocess the data. In this paper, the independent component analysis (ICA) algorithm is used to eliminate the data noise and the SMOTE algorithm is used to balance the data, which makes the processed data more adaptable to the ensemble learning algorithm. The experimental results show that the proposed method can significantly improve the classification performance of the integrated learning algorithm bagging for unbalanced data. Different types of data have a certain organization and structure information, and attributes are related to each other. Through research and analysis, the feature attribute of garbage page dataset is not only high dimension but also high correlation degree. In view of the high dimension and relevance between the content features and link features of spam pages, this paper makes a principal component analysis (PCA) instead of global principal component analysis (PCA) on the basis of in-depth research on the feature attributes of spam pages. This not only reduces the dimension, but also protects the original attribute structure of the data set to a certain extent. The experimental results show that the proposed method has good performance in the classification of garbage pages.
【學(xué)位授予單位】:山東師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP181

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 JI Hua;ZHANG Huaxiang;;Analysis on the Content Features and Their Correlation of Web Pages for Spam Detection[J];中國(guó)通信;2015年03期

2 付忠良;;關(guān)于AdaBoost有效性的分析[J];計(jì)算機(jī)研究與發(fā)展;2008年10期

,

本文編號(hào):2095809

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2095809.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶7e6ea***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com