面向不平衡分布數(shù)據(jù)的主動(dòng)極限學(xué)習(xí)機(jī)算法研究
發(fā)布時(shí)間:2018-11-05 14:03
【摘要】:近年來(lái),隨著數(shù)據(jù)獲取與數(shù)據(jù)存儲(chǔ)技術(shù)的高速發(fā)展,各行各業(yè)均積累了海量的數(shù)據(jù),如何對(duì)這些海量數(shù)據(jù)進(jìn)行分析成為了困擾機(jī)器學(xué)習(xí)與數(shù)據(jù)挖掘領(lǐng)域研究者的核心問(wèn)題。例如,對(duì)這海量數(shù)據(jù)的類(lèi)別進(jìn)行標(biāo)注,進(jìn)而建立分類(lèi)模型,無(wú)疑會(huì)大幅增加人力、物力與時(shí)間成本的開(kāi)銷(xiāo),而主動(dòng)學(xué)習(xí)則是可有效解決上述問(wèn)題的利器。經(jīng)過(guò)多年研究,研究人員已提出了多種有效的主動(dòng)學(xué)習(xí)算法,但其均忽略了一個(gè)重要問(wèn)題,即在樣本不平衡分布場(chǎng)景下,這些算法是否會(huì)仍舊有效。故本文主要研究在類(lèi)別不平衡數(shù)據(jù)中如何保持主動(dòng)學(xué)習(xí)的效率與性能。針對(duì)上述問(wèn)題,本文主要圍繞在不平衡數(shù)據(jù)分布中,如何改進(jìn)主動(dòng)學(xué)習(xí)算法使其分類(lèi)性能達(dá)到最優(yōu)這一問(wèn)題展開(kāi)研究,主要研究?jī)?nèi)容包括以下兩個(gè)方面:1)針對(duì)在不平衡分布數(shù)據(jù)中執(zhí)行主動(dòng)學(xué)習(xí),其分類(lèi)面容易形成偏倚,從而導(dǎo)致主動(dòng)學(xué)習(xí)失效這一問(wèn)題,擬采用采樣技術(shù)作為學(xué)習(xí)過(guò)程的平衡控制策略,在調(diào)查了幾種已有的采樣算法的基礎(chǔ)上,提出了一種邊界過(guò)采樣算法,并將其與主動(dòng)學(xué)習(xí)相結(jié)合。且由于極限學(xué)習(xí)機(jī)具有泛化能力強(qiáng)、訓(xùn)練速度快等優(yōu)點(diǎn),擬采用其作為基分類(lèi)器,來(lái)加速主動(dòng)學(xué)習(xí)的進(jìn)程。并通過(guò)12個(gè)基準(zhǔn)數(shù)據(jù)集對(duì)加入了平衡控制策略的主動(dòng)學(xué)習(xí)算法的性能進(jìn)行了驗(yàn)證。結(jié)果表明,在不平衡場(chǎng)景下,主動(dòng)學(xué)習(xí)方法確實(shí)會(huì)受到影響,且采取了樣本采樣技術(shù)的主動(dòng)學(xué)習(xí)方法性能更優(yōu)。2)為了實(shí)現(xiàn)更快的訓(xùn)練速度,引入了在線學(xué)習(xí),進(jìn)而提出了一種在線加權(quán)極限學(xué)習(xí)機(jī)算法,即OS-W-ELM算法。同時(shí)擬采用代價(jià)敏感學(xué)習(xí)技術(shù)作為學(xué)習(xí)過(guò)程中的平衡控制策略,并與主動(dòng)學(xué)習(xí)相結(jié)合。此實(shí)驗(yàn)仍是以極限學(xué)習(xí)機(jī)作為基分類(lèi)器。并采用與上述實(shí)驗(yàn)相同的12個(gè)基準(zhǔn)數(shù)據(jù)集,對(duì)AL-OS-W-ELM算法、AL-OS-ELM算法和RS-OS-W-ELM算法的性能進(jìn)行了比較。同時(shí)將AL-OS-W-ELM算法、AL-OS-ELM算法與加入了采樣技術(shù)的主動(dòng)學(xué)習(xí)算法在運(yùn)行時(shí)間上進(jìn)行了對(duì)比。結(jié)果表明,在不平衡場(chǎng)景下,采取了在線學(xué)習(xí)與代價(jià)敏感學(xué)習(xí)技術(shù)的主動(dòng)學(xué)習(xí)方法性能更優(yōu)。
[Abstract]:In recent years, with the rapid development of data acquisition and data storage technology, a variety of industries have accumulated massive data, how to analyze these massive data has become a core problem for researchers in the field of machine learning and data mining. For example, tagging the huge data category and establishing classification model will undoubtedly increase the cost of manpower, material resources and time cost, and active learning is the effective weapon to solve the above problems. After many years of research, researchers have proposed a variety of effective active learning algorithms, but they all ignore an important question, that is, whether these algorithms will still be effective in the scenario of uneven distribution of samples. Therefore, this paper focuses on how to maintain the efficiency and performance of active learning in class imbalance data. In view of the above problems, this paper focuses on how to improve the active learning algorithm to achieve the optimal classification performance in the unbalanced data distribution. The main research contents include the following two aspects: 1) in order to solve the problem of active learning in unbalanced distributed data, the classification surface is prone to bias, which leads to the failure of active learning. Based on the investigation of several existing sampling algorithms, a boundary oversampling algorithm is proposed and combined with active learning. Because extreme learning machine has the advantages of strong generalization ability and fast training speed, it is proposed to use it as a base classifier to speed up the process of active learning. The performance of active learning algorithm with balanced control strategy is verified by 12 datum data sets. The results show that the active learning method will be affected in the unbalanced scenario, and the performance of the active learning method with sample sampling technique is better. 2) in order to achieve faster training speed, online learning is introduced. Furthermore, an online weighted limit learning machine algorithm, OS-W-ELM algorithm, is proposed. At the same time, the cost sensitive learning technique is adopted as the balance control strategy in the learning process and combined with active learning. This experiment still uses the extreme learning machine as the base classifier. The performance of AL-OS-W-ELM algorithm, AL-OS-ELM algorithm and RS-OS-W-ELM algorithm is compared with 12 datum data sets. At the same time, the AL-OS-W-ELM algorithm, the AL-OS-ELM algorithm and the active learning algorithm with sampling technology are compared in the running time. The results show that the active learning method based on online learning and cost sensitive learning is better in the unbalanced scenario.
【學(xué)位授予單位】:江蘇科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP18
本文編號(hào):2312301
[Abstract]:In recent years, with the rapid development of data acquisition and data storage technology, a variety of industries have accumulated massive data, how to analyze these massive data has become a core problem for researchers in the field of machine learning and data mining. For example, tagging the huge data category and establishing classification model will undoubtedly increase the cost of manpower, material resources and time cost, and active learning is the effective weapon to solve the above problems. After many years of research, researchers have proposed a variety of effective active learning algorithms, but they all ignore an important question, that is, whether these algorithms will still be effective in the scenario of uneven distribution of samples. Therefore, this paper focuses on how to maintain the efficiency and performance of active learning in class imbalance data. In view of the above problems, this paper focuses on how to improve the active learning algorithm to achieve the optimal classification performance in the unbalanced data distribution. The main research contents include the following two aspects: 1) in order to solve the problem of active learning in unbalanced distributed data, the classification surface is prone to bias, which leads to the failure of active learning. Based on the investigation of several existing sampling algorithms, a boundary oversampling algorithm is proposed and combined with active learning. Because extreme learning machine has the advantages of strong generalization ability and fast training speed, it is proposed to use it as a base classifier to speed up the process of active learning. The performance of active learning algorithm with balanced control strategy is verified by 12 datum data sets. The results show that the active learning method will be affected in the unbalanced scenario, and the performance of the active learning method with sample sampling technique is better. 2) in order to achieve faster training speed, online learning is introduced. Furthermore, an online weighted limit learning machine algorithm, OS-W-ELM algorithm, is proposed. At the same time, the cost sensitive learning technique is adopted as the balance control strategy in the learning process and combined with active learning. This experiment still uses the extreme learning machine as the base classifier. The performance of AL-OS-W-ELM algorithm, AL-OS-ELM algorithm and RS-OS-W-ELM algorithm is compared with 12 datum data sets. At the same time, the AL-OS-W-ELM algorithm, the AL-OS-ELM algorithm and the active learning algorithm with sampling technology are compared in the running time. The results show that the active learning method based on online learning and cost sensitive learning is better in the unbalanced scenario.
【學(xué)位授予單位】:江蘇科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP18
【參考文獻(xiàn)】
中國(guó)期刊全文數(shù)據(jù)庫(kù) 前4條
1 翟云;楊炳儒;曲武;;不平衡類(lèi)數(shù)據(jù)挖掘研究綜述[J];計(jì)算機(jī)科學(xué);2010年10期
2 王和勇;樊泓坤;姚正安;李成安;;不平衡數(shù)據(jù)集的分類(lèi)方法研究[J];計(jì)算機(jī)應(yīng)用研究;2008年05期
3 林智勇;郝志峰;楊曉偉;;不平衡數(shù)據(jù)分類(lèi)的研究現(xiàn)狀[J];計(jì)算機(jī)應(yīng)用研究;2008年02期
4 龍軍;殷建平;祝恩;趙文濤;;主動(dòng)學(xué)習(xí)研究綜述[J];計(jì)算機(jī)研究與發(fā)展;2008年S1期
,本文編號(hào):2312301
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2312301.html
最近更新
教材專(zhuān)著