基于遷移學(xué)習(xí)和PU學(xué)習(xí)的軟件故障預(yù)測(cè)方法研究

發(fā)布時(shí)間：2019-01-03 14:10

【摘要】：隨著人工智能的不斷發(fā)展,機(jī)器學(xué)習(xí)技術(shù)已被應(yīng)用于軟件故障預(yù)測(cè)中,傳統(tǒng)基于機(jī)器學(xué)習(xí)的軟件故障預(yù)測(cè)需要大量已標(biāo)注樣本進(jìn)行模型構(gòu)建。而現(xiàn)實(shí)中,已標(biāo)注軟件故障數(shù)據(jù)往往通過(guò)人工測(cè)試后獲取,費(fèi)時(shí)費(fèi)力代價(jià)高昂。為了降低傳統(tǒng)軟件故障預(yù)測(cè)方法在有監(jiān)督學(xué)習(xí)場(chǎng)景下對(duì)標(biāo)注樣本的需求,本文從正例未標(biāo)注學(xué)習(xí)(Positive and Unlabeled Learning,PU學(xué)習(xí))和遷移學(xué)習(xí)兩方面展開(kāi)研究,提出針對(duì)PU場(chǎng)景下,通過(guò)對(duì)跨公司、跨項(xiàng)目正例未標(biāo)注故障數(shù)據(jù)進(jìn)行知識(shí)遷移,對(duì)目標(biāo)故障樣本進(jìn)行預(yù)測(cè),具體工作如下:(1)PU場(chǎng)景下基于隨機(jī)森林的實(shí)例遷移算法(POSTRF算法)該算法在PU場(chǎng)景下,基于貝葉斯跨類(lèi)遷移思想,將待預(yù)測(cè)樣本視為目標(biāo)領(lǐng)域數(shù)據(jù)集,將跨公司、跨項(xiàng)目的軟件故障樣本視為源領(lǐng)域數(shù)據(jù)集,對(duì)源領(lǐng)域數(shù)據(jù)集進(jìn)行有放回抽樣訓(xùn)練得到多棵PU隨機(jī)決策樹(shù),根據(jù)對(duì)目標(biāo)領(lǐng)域數(shù)據(jù)測(cè)試得到的AUC值及采樣集樣本計(jì)算樣本權(quán)重,通過(guò)遷移與目標(biāo)領(lǐng)域數(shù)據(jù)具有相似分布的樣本與目標(biāo)領(lǐng)域數(shù)據(jù)共同構(gòu)建PU數(shù)據(jù)集,基于POSC4.5算法構(gòu)建模型來(lái)對(duì)目標(biāo)領(lǐng)域的軟件故障樣本進(jìn)行預(yù)測(cè)。算法首先對(duì)源領(lǐng)域數(shù)據(jù)集以bagSize比例進(jìn)行有放回抽樣得到M份采樣集并訓(xùn)練M棵PU隨機(jī)決策樹(shù),從目標(biāo)領(lǐng)域中隨機(jī)抽取75%樣本作為測(cè)試集對(duì)M棵隨機(jī)決策樹(shù)進(jìn)行分類(lèi)測(cè)試,將每棵樹(shù)的AUC值(Area Under the ROC Curve)作為各樹(shù)權(quán)重,根據(jù)樹(shù)權(quán)重對(duì)采樣集樣本加權(quán),將采樣集樣本權(quán)重合并得到最終樣本權(quán)重,以遷移比r遷移權(quán)重較高樣本完成實(shí)例遷移。對(duì)遷移樣本和目標(biāo)領(lǐng)域數(shù)據(jù)集基于完全隨機(jī)假設(shè)構(gòu)建PU數(shù)據(jù)集,以正例樣本數(shù)、未標(biāo)注樣本數(shù)和正例先驗(yàn)概率計(jì)算屬性的不確定信息增益,通過(guò)選擇最大不確定信息增益屬性為分支節(jié)點(diǎn),自上而下遞歸生成樹(shù)模型,對(duì)目標(biāo)領(lǐng)域故障樣本進(jìn)行預(yù)測(cè)。(2)針對(duì)POSTRF算法實(shí)驗(yàn)將NASA數(shù)據(jù)庫(kù)的8個(gè)軟件故障數(shù)據(jù)集作為實(shí)驗(yàn)數(shù)據(jù)集,分別以0kc3、cm1數(shù)據(jù)集作為目標(biāo)領(lǐng)域數(shù)據(jù)集,其余數(shù)據(jù)集作為源領(lǐng)域數(shù)據(jù)集,將本文的算法與POSC4.5算法進(jìn)行對(duì)比實(shí)驗(yàn)結(jié)果表明,POSTRF算法在0kc3和cm1目標(biāo)集上通過(guò)遷移其他輔助集實(shí)例樣本,提升了模型分類(lèi)性能,且AUC值提高了約3%-12%,故障預(yù)測(cè)率PD提高了約5%。因此,本文提出的POSTRF算法通過(guò)對(duì)跨項(xiàng)目、跨公司軟件故障數(shù)據(jù)進(jìn)行知識(shí)遷移,與傳統(tǒng)PU學(xué)習(xí)算法相比對(duì)目標(biāo)領(lǐng)域故障樣本具有相當(dāng)或更好的預(yù)測(cè)性能。
[Abstract]:With the continuous development of artificial intelligence, machine learning technology has been applied to software fault prediction. Traditional software fault prediction based on machine learning requires a large number of labeled samples for modeling. In reality, tagged software fault data are often acquired by manual testing, which is time-consuming and costly. In order to reduce the requirement of traditional software fault prediction methods for labeled samples in supervised learning scenarios, this paper studies the two aspects of positive unannotated learning (Positive and Unlabeled Learning,PU learning and migration learning, and proposes a new approach for PU scenarios. Through knowledge transfer of cross-company, cross-project unannotated fault data, the target fault samples are predicted. The main works are as follows: (1) in PU scenario, the instance migration algorithm based on stochastic forest (POSTRF algorithm). Under the PU scenario, based on Bayesian idea of cross-class migration, the sample to be predicted is regarded as the target domain data set, which will be cross-company. The software fault samples of cross-project are regarded as source domain data sets. The source domain data sets are trained with backward-back sampling to obtain multiple PU random decision trees. The sample weights are calculated according to the AUC values obtained from the test of the target domain data and the samples from the sample sets. The PU data set is constructed by migrating samples with similar distribution to target domain data and building model based on POSC4.5 algorithm to predict software fault samples in target domain. Firstly, M samples are collected by bagSize scale and M PU random decision trees are trained, and 75% samples are randomly extracted from the target domain as test sets to classify M random decision trees. The AUC value (Area Under the ROC Curve) of each tree is taken as the weight of each tree, the sample weight of the sample set is weighted according to the tree weight, and the final sample weight is obtained by combining the sample weight of the sample set, so that the sample with higher migration weight than r is used to complete the sample migration. Based on the complete random assumption, the PU data set is constructed for migrating samples and target domain data sets. The uncertain information gain of attributes is calculated with positive sample number, unlabeled sample number and positive prior probability. By selecting the maximum uncertain information gain attribute as the branch node, the top-down recursive tree model is generated. The target domain fault samples are predicted. (2) eight software fault data sets of NASA database are used as experimental data sets, and 0kc3cm1 data sets are used as target domain data sets respectively. The other data sets are used as source domain data sets. The experimental results show that the POSTRF algorithm improves the classification performance of the model by migrating the sample samples of other auxiliary sets on the 0kc3 and cm1 target sets by comparing the proposed algorithm with the POSC4.5 algorithm. The AUC value increased about 3-12 and the fault prediction rate PD increased about 5%. Therefore, the proposed POSTRF algorithm has comparable or better prediction performance to the target domain fault samples than the traditional PU learning algorithm through knowledge migration of cross-project and cross-company software fault data.
【學(xué)位授予單位】：西北農(nóng)林科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類(lèi)號(hào)】：TP311.53

【參考文獻(xiàn)】

相關(guān)期刊論文前7條

1 張荷;李梅;張陽(yáng);蔡曉妍;;基于PU學(xué)習(xí)的軟件故障檢測(cè)研究[J];計(jì)算機(jī)應(yīng)用研究;2015年11期

2 石慧;賈代平;苗培;;基于詞頻信息的改進(jìn)信息增益文本特征選擇算法[J];計(jì)算機(jī)應(yīng)用;2014年11期

3 鄭科鵬;馮筠;孫霞;馮宏偉;曹?chē)?guó)震;;基于靜態(tài)集成PU學(xué)習(xí)數(shù)據(jù)流分類(lèi)的入侵檢測(cè)方法[J];西北大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年04期

4 莊福振;羅平;何清;史忠植;;遷移學(xué)習(xí)研究進(jìn)展[J];軟件學(xué)報(bào);2015年01期

5 張汗靈;湯隆慧;周敏;;基于KMM匹配的參數(shù)遷移學(xué)習(xí)算法[J];湖南大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年04期

6 賀濤;曹先彬;譚輝;;基于免疫的中文網(wǎng)絡(luò)短文本聚類(lèi)算法[J];自動(dòng)化學(xué)報(bào);2009年07期

7 于玲;吳鐵軍;;集成學(xué)習(xí):Boosting算法綜述[J];模式識(shí)別與人工智能;2004年01期

相關(guān)碩士學(xué)位論文前3條

1 韋余永;基于實(shí)例與特征的遷移學(xué)習(xí)文本分類(lèi)方法研究[D];西南大學(xué);2015年

2 周興勤;基于選擇性集成的增量學(xué)習(xí)研究[D];重慶大學(xué);2014年

3 何佳珍;不確定數(shù)據(jù)的PU學(xué)習(xí)貝葉斯分類(lèi)器研究[D];西北農(nóng)林科技大學(xué);2012年

，

本文編號(hào)：2399484

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2399484.html

上一篇：基于雙樹(shù)框架的軟件項(xiàng)目質(zhì)量管理研究
下一篇：適用于移動(dòng)互聯(lián)網(wǎng)的門(mén)限群簽名方案

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于遷移學(xué)習(xí)和PU學(xué)習(xí)的軟件故障預(yù)測(cè)方法研究