面向類不平衡問題的邏輯回歸分類學(xué)習(xí)算法研究
發(fā)布時(shí)間:2018-04-24 16:39
本文選題:邏輯回歸 + 類不平衡。 參考:《信陽師范學(xué)院》2017年碩士論文
【摘要】:類不平衡問題是模式識(shí)別和機(jī)器學(xué)習(xí)領(lǐng)域的熱門研究問題之一,其特征是某些類實(shí)例數(shù)明顯少于其它類實(shí)例數(shù)。在實(shí)際應(yīng)用中,正確識(shí)別少數(shù)類實(shí)例往往比正確識(shí)別多數(shù)類實(shí)例更有價(jià)值。例如在醫(yī)療診斷中,只有極少數(shù)人是癌癥患者,如何正確識(shí)別這些癌癥患者具有重要意義。然而,作為經(jīng)典的統(tǒng)計(jì)分類方法,邏輯回歸試圖通過假設(shè)數(shù)據(jù)集中各類的實(shí)例數(shù)目相當(dāng),以達(dá)到總體高準(zhǔn)確率的分類目的。這往往導(dǎo)致學(xué)習(xí)到的模型不能很好地捕獲少數(shù)類實(shí)例特征,進(jìn)而誤分少數(shù)類實(shí)例。針對(duì)該問題,本文提出了兩種面向類不平衡問題的邏輯回歸分類學(xué)習(xí)算法:(1)提出新的針對(duì)類不平衡的邏輯回歸學(xué)習(xí)算法。邏輯回歸使用最大似然估計(jì)法求解模型參數(shù),這導(dǎo)致模型很難捕獲少數(shù)類實(shí)例特征。針對(duì)該問題,本文構(gòu)造了一種基于最大似然函數(shù)和召回率的度量指標(biāo)MLER(Maximum Likelihood Evaluation and Recall)。與最大似然目標(biāo)函數(shù)不同,MLER同時(shí)考慮模型的準(zhǔn)確率和召回率,進(jìn)而保證模型在所有類上的性能。根據(jù)MLER,本文提出了一種面向類不平衡問題的邏輯回歸新算法LRIL(Logistic Regression for Imbalanced Learning)。依據(jù)MLER,LRIL使用牛頓法學(xué)習(xí)相關(guān)參數(shù)。實(shí)驗(yàn)結(jié)果表明,LRIL在保持邏輯回歸高準(zhǔn)確率的前提下,有效地提高了其在召回率、f-measure以及g-mean上的性能,同時(shí)與其它高級(jí)方法相比,LRIL也表現(xiàn)出明顯優(yōu)勢。(2)針對(duì)類不平衡問題中類分布不均衡這一特征,提出了基于k-means和邏輯回歸混合策略的類不平衡學(xué)習(xí)算法ILKLR(Imbalanced Learning based on k-means and Logistic Regression)。不同于傳統(tǒng)的邏輯回歸方法,ILKLR采用k-means算法將多數(shù)類數(shù)據(jù)集劃分成多個(gè)子簇并關(guān)聯(lián)新的類標(biāo)號(hào),進(jìn)而達(dá)到訓(xùn)練集線性可分的目的。實(shí)驗(yàn)結(jié)果顯示,本文提出的數(shù)據(jù)預(yù)處理方法比傳統(tǒng)邏輯回歸、欠抽樣邏輯回歸、過抽樣邏輯回歸等方法在召回率、g-mean和f-measure等指標(biāo)上效果更優(yōu)。
[Abstract]:Class imbalance is one of the most popular problems in the field of pattern recognition and machine learning, which is characterized by the fact that the number of instances in some classes is obviously less than the number of instances in other classes. In practical applications, it is more valuable to recognize a few class instances correctly than to identify most class instances correctly. For example, in medical diagnosis, only a small number of people are cancer patients, how to correctly identify these cancer patients has important significance. However, as a classical statistical classification method, logical regression attempts to achieve the goal of overall high accuracy by assuming that the number of instances in the data set is equal. This often leads to the learning model can not capture the characteristics of a few class instances and misdivide the few instances. In order to solve this problem, this paper proposes two kinds of learning algorithms of logic regression classification for class unbalance problem: (1) A new algorithm of logic regression learning for class unbalance is proposed. The method of maximum likelihood estimation is used to solve the model parameters, which makes it difficult for the model to capture a few instance features. In order to solve this problem, a MLER(Maximum Likelihood Evaluation and recall based on maximum likelihood function and recall rate is constructed in this paper. Different from the maximum likelihood objective function, MLER considers the accuracy and recall of the model simultaneously, thus ensuring the performance of the model on all classes. According to MLERs, this paper presents a new logic regression algorithm, LRIL(Logistic Regression for Imbalanced learning, which is oriented to class imbalance problem. According to MLER-LRIL, Newton's method is used to learn the relevant parameters. The experimental results show that LRIL can effectively improve its performance on f-measure and g-mean on the premise of keeping high accuracy of logical regression. At the same time, compared with other advanced methods, LRIL also shows obvious advantages. (2) aiming at the feature of class disequilibrium in class imbalance problem, a class unbalance learning algorithm ILKLR(Imbalanced Learning based on k-means and Logistic regulation based on the mixed strategy of k-means and logical regression is proposed. Different from the traditional logical regression method, ILKLR uses k-means algorithm to divide the majority of class data sets into multiple subclusters and associate new class labels, thus achieving the purpose of linearly separable training sets. The experimental results show that the proposed data preprocessing method is more effective than the traditional logical regression, under-sampling logical regression and over-sampling logical regression in the recall rate of g-mean and f-measure.
【學(xué)位授予單位】:信陽師范學(xué)院
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 鄔長安;鄭桂榮;郭華平;;不平衡類分類問題的邏輯判別式算法[J];信陽師范學(xué)院學(xué)報(bào)(自然科學(xué)版);2016年02期
2 郭華平;董亞東;鄔長安;范明;;面向類不平衡的邏輯回歸方法[J];模式識(shí)別與人工智能;2015年08期
3 職為梅;郭華平;范明;葉陽東;;非平衡數(shù)據(jù)集分類方法探討[J];計(jì)算機(jī)科學(xué);2012年S1期
4 職為梅;范明;葉陽東;;樣本大小對(duì)非平衡數(shù)據(jù)分類的影響[J];微型機(jī)與應(yīng)用;2010年19期
相關(guān)博士學(xué)位論文 前1條
1 唐明珠;類別不平衡和誤分類代價(jià)不等的數(shù)據(jù)集分類方法及應(yīng)用[D];中南大學(xué);2012年
,本文編號(hào):1797447
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1797447.html
最近更新
教材專著