利用未標(biāo)記數(shù)據(jù)的機器學(xué)習(xí)方法研究

發(fā)布時間：2018-04-23 10:26

本文選題：機器學(xué)習(xí) + 半監(jiān)督學(xué)習(xí)�。� 參考：《南京大學(xué)》2017年碩士論文

【摘要】：機器學(xué)習(xí)需要有標(biāo)記數(shù)據(jù)來訓(xùn)練模型進行預(yù)測,有標(biāo)記數(shù)據(jù)的獲取通常需要人工參與,因此價格非常昂貴。在很多實際應(yīng)用中,未標(biāo)記數(shù)據(jù)可以較為容易地大量獲取,如何利用廉價的未標(biāo)記數(shù)據(jù)一直以來都是機器學(xué)習(xí)領(lǐng)域中的研究熱點。目前出現(xiàn)了兩種利用未標(biāo)記數(shù)據(jù)的方法:一種是自動利用未標(biāo)記數(shù)據(jù)輔助有標(biāo)記數(shù)據(jù)提升學(xué)習(xí)性能的半監(jiān)督學(xué)習(xí);雖然該類方法大多能夠提升學(xué)習(xí)性能,但都基于潛在的模型假設(shè),當(dāng)模型假設(shè)與數(shù)據(jù)分布存在偏差時可能會降低學(xué)習(xí)性能;另一種是通過眾包以較低的代價給數(shù)據(jù)提供標(biāo)記,進而可以精確利用未標(biāo)記數(shù)據(jù)以降低學(xué)習(xí)風(fēng)險。本文主要圍繞半監(jiān)督學(xué)習(xí)和眾包進行研究,取得了以下進展:第一,針對半監(jiān)督學(xué)習(xí)中的重要風(fēng)范協(xié)同訓(xùn)練易受不充分視圖的影響這一問題,提出了一種新型的加權(quán)協(xié)同訓(xùn)練算法。視圖不充分時協(xié)同訓(xùn)練過程中會出現(xiàn)與最優(yōu)分類器不一致的樣本,該算法通過檢測潛在的不一致樣本并降低其權(quán)值以減少這些樣本對訓(xùn)練過程的影響。實驗結(jié)果表明,與標(biāo)準(zhǔn)的協(xié)同訓(xùn)練算法相比該算法有更好的泛化性能與更強的魯棒性。第二,針對眾包過程中任務(wù)標(biāo)記依賴于任務(wù)難度這一特點,提出了一種新型的任務(wù)分配算法。該算法通過估計部分任務(wù)的難度構(gòu)建訓(xùn)練集學(xué)得預(yù)測難度的模型,將任務(wù)分為簡單和困難兩類。對于簡單的任務(wù)可利用眾包進行標(biāo)記;而對于困難的任務(wù),則需雇傭?qū)＜覟槠涮峁└哔|(zhì)量標(biāo)記。實驗結(jié)果表明該算法能夠在提高標(biāo)記質(zhì)量的同時降低標(biāo)記代價。此外,本文還對利用未標(biāo)記數(shù)據(jù)的模型復(fù)用進行了研究,該場景中用戶需要集成多個無法修改的預(yù)訓(xùn)練模型,針對這一問題,本文提出了一種新型的多視圖模型復(fù)用算法。該算法通過信念傳播估計預(yù)訓(xùn)練模型的可靠性,并基于未標(biāo)記數(shù)據(jù)上的多視圖一致性指導(dǎo)這一估計過程,進而利用估計得到的可靠性加權(quán)集成多個預(yù)訓(xùn)練模型。實驗結(jié)果表明該方法能夠顯著提升分類精度。
[Abstract]:Machine learning requires labeled data to train models for prediction, and the acquisition of labeled data usually requires manual participation, so the price is very expensive. In many practical applications, unlabeled data can be easily obtained in large quantities. How to use cheap unlabeled data has always been a hot topic in the field of machine learning. At present, there are two methods to use unlabeled data: one is to use unlabeled data automatically to assist semi-supervised learning with labeled data to improve learning performance, although most of these methods can improve learning performance. But both are based on underlying model assumptions, which can reduce learning performance when the model assumption deviates from the data distribution; the other is to tag the data at a lower cost through crowdsourcing. Furthermore, unlabeled data can be used accurately to reduce the risk of learning. This paper mainly focuses on semi-supervised learning and crowdsourcing, and has made the following progress: first, aiming at the problem that the important cooperative training in semi-supervised learning is easily affected by insufficient views, A new weighted cooperative training algorithm is proposed. When the view is not sufficient, there will be samples that are inconsistent with the optimal classifier. The algorithm can reduce the influence of these samples on the training process by detecting the potentially inconsistent samples and reducing their weights. Experimental results show that the proposed algorithm has better generalization performance and better robustness than the standard cooperative training algorithm. Secondly, a new task assignment algorithm is proposed to solve the problem that task marking depends on task difficulty in crowdsourcing. By estimating the difficulty of some tasks, the algorithm constructs a training set model to predict the difficulty, and divides the task into two categories: simple and difficult. Simple tasks can be tagged with crowdsourcing; for difficult tasks, specialists are hired to provide high quality tags. Experimental results show that the proposed algorithm can improve the marking quality and reduce the marking cost. In addition, this paper also studies the reuse of models using unlabeled data. In this scenario, users need to integrate several pre-training models that can not be modified. In order to solve this problem, a new multi-view model reuse algorithm is proposed in this paper. The algorithm estimates the reliability of the pre-training model through belief propagation, and guides the estimation process based on multi-view consistency on unlabeled data, and then integrates multiple pre-training models weighted by the estimated reliability. Experimental results show that this method can significantly improve the classification accuracy.
【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP181

【共引文獻】

相關(guān)期刊論文前10條

1 朱小香;許金森;薩U喲，

本文編號：1791566

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1791566.html

上一篇：被動柔性變剛度關(guān)節(jié)驅(qū)動系統(tǒng)研究
下一篇：基于MODIS的陜西省植被變化遙感監(jiān)測分析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

利用未標(biāo)記數(shù)據(jù)的機器學(xué)習(xí)方法研究