利用未標記數(shù)據(jù)的機器學(xué)習(xí)方法研究
發(fā)布時間:2018-04-23 10:26
本文選題:機器學(xué)習(xí) + 半監(jiān)督學(xué)習(xí); 參考:《南京大學(xué)》2017年碩士論文
【摘要】:機器學(xué)習(xí)需要有標記數(shù)據(jù)來訓(xùn)練模型進行預(yù)測,有標記數(shù)據(jù)的獲取通常需要人工參與,因此價格非常昂貴。在很多實際應(yīng)用中,未標記數(shù)據(jù)可以較為容易地大量獲取,如何利用廉價的未標記數(shù)據(jù)一直以來都是機器學(xué)習(xí)領(lǐng)域中的研究熱點。目前出現(xiàn)了兩種利用未標記數(shù)據(jù)的方法:一種是自動利用未標記數(shù)據(jù)輔助有標記數(shù)據(jù)提升學(xué)習(xí)性能的半監(jiān)督學(xué)習(xí);雖然該類方法大多能夠提升學(xué)習(xí)性能,但都基于潛在的模型假設(shè),當(dāng)模型假設(shè)與數(shù)據(jù)分布存在偏差時可能會降低學(xué)習(xí)性能;另一種是通過眾包以較低的代價給數(shù)據(jù)提供標記,進而可以精確利用未標記數(shù)據(jù)以降低學(xué)習(xí)風(fēng)險。本文主要圍繞半監(jiān)督學(xué)習(xí)和眾包進行研究,取得了以下進展:第一,針對半監(jiān)督學(xué)習(xí)中的重要風(fēng)范協(xié)同訓(xùn)練易受不充分視圖的影響這一問題,提出了一種新型的加權(quán)協(xié)同訓(xùn)練算法。視圖不充分時協(xié)同訓(xùn)練過程中會出現(xiàn)與最優(yōu)分類器不一致的樣本,該算法通過檢測潛在的不一致樣本并降低其權(quán)值以減少這些樣本對訓(xùn)練過程的影響。實驗結(jié)果表明,與標準的協(xié)同訓(xùn)練算法相比該算法有更好的泛化性能與更強的魯棒性。第二,針對眾包過程中任務(wù)標記依賴于任務(wù)難度這一特點,提出了一種新型的任務(wù)分配算法。該算法通過估計部分任務(wù)的難度構(gòu)建訓(xùn)練集學(xué)得預(yù)測難度的模型,將任務(wù)分為簡單和困難兩類。對于簡單的任務(wù)可利用眾包進行標記;而對于困難的任務(wù),則需雇傭?qū)<覟槠涮峁└哔|(zhì)量標記。實驗結(jié)果表明該算法能夠在提高標記質(zhì)量的同時降低標記代價。此外,本文還對利用未標記數(shù)據(jù)的模型復(fù)用進行了研究,該場景中用戶需要集成多個無法修改的預(yù)訓(xùn)練模型,針對這一問題,本文提出了一種新型的多視圖模型復(fù)用算法。該算法通過信念傳播估計預(yù)訓(xùn)練模型的可靠性,并基于未標記數(shù)據(jù)上的多視圖一致性指導(dǎo)這一估計過程,進而利用估計得到的可靠性加權(quán)集成多個預(yù)訓(xùn)練模型。實驗結(jié)果表明該方法能夠顯著提升分類精度。
[Abstract]:Machine learning requires labeled data to train models for prediction, and the acquisition of labeled data usually requires manual participation, so the price is very expensive. In many practical applications, unlabeled data can be easily obtained in large quantities. How to use cheap unlabeled data has always been a hot topic in the field of machine learning. At present, there are two methods to use unlabeled data: one is to use unlabeled data automatically to assist semi-supervised learning with labeled data to improve learning performance, although most of these methods can improve learning performance. But both are based on underlying model assumptions, which can reduce learning performance when the model assumption deviates from the data distribution; the other is to tag the data at a lower cost through crowdsourcing. Furthermore, unlabeled data can be used accurately to reduce the risk of learning. This paper mainly focuses on semi-supervised learning and crowdsourcing, and has made the following progress: first, aiming at the problem that the important cooperative training in semi-supervised learning is easily affected by insufficient views, A new weighted cooperative training algorithm is proposed. When the view is not sufficient, there will be samples that are inconsistent with the optimal classifier. The algorithm can reduce the influence of these samples on the training process by detecting the potentially inconsistent samples and reducing their weights. Experimental results show that the proposed algorithm has better generalization performance and better robustness than the standard cooperative training algorithm. Secondly, a new task assignment algorithm is proposed to solve the problem that task marking depends on task difficulty in crowdsourcing. By estimating the difficulty of some tasks, the algorithm constructs a training set model to predict the difficulty, and divides the task into two categories: simple and difficult. Simple tasks can be tagged with crowdsourcing; for difficult tasks, specialists are hired to provide high quality tags. Experimental results show that the proposed algorithm can improve the marking quality and reduce the marking cost. In addition, this paper also studies the reuse of models using unlabeled data. In this scenario, users need to integrate several pre-training models that can not be modified. In order to solve this problem, a new multi-view model reuse algorithm is proposed in this paper. The algorithm estimates the reliability of the pre-training model through belief propagation, and guides the estimation process based on multi-view consistency on unlabeled data, and then integrates multiple pre-training models weighted by the estimated reliability. Experimental results show that this method can significantly improve the classification accuracy.
【學(xué)位授予單位】:南京大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP181
【共引文獻】
相關(guān)期刊論文 前10條
1 朱小香;許金森;薩U喲,
本文編號:1791566
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1791566.html
最近更新
教材專著