面向異源數(shù)據(jù)的機(jī)器學(xué)習(xí)算法研究
發(fā)布時(shí)間:2018-01-22 11:08
本文關(guān)鍵詞: 機(jī)器學(xué)習(xí) 異源數(shù)據(jù) 同構(gòu)異源數(shù)據(jù) 異構(gòu)異源數(shù)據(jù) 群智學(xué)習(xí) 遷移學(xué)習(xí) 出處:《中國(guó)科學(xué)技術(shù)大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:數(shù)據(jù)同源是傳統(tǒng)機(jī)器學(xué)習(xí)依賴的基本假設(shè),即訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù)服從相同分布。但現(xiàn)實(shí)環(huán)境中同源數(shù)據(jù)十分稀缺,有限的同源數(shù)據(jù)無(wú)法訓(xùn)練出有效機(jī)器學(xué)習(xí)模型,這就是同源數(shù)據(jù)稀缺問(wèn)題。解決同源數(shù)據(jù)稀缺問(wèn)題的一個(gè)方法是人工構(gòu)造同源數(shù)據(jù),但這種方法成本過(guò)高。解決同源數(shù)據(jù)稀缺問(wèn)題的另一個(gè)有效方法是整合分布不同的異源數(shù)據(jù)來(lái)進(jìn)行機(jī)器學(xué)習(xí)模型的訓(xùn)練,因此面向異源數(shù)據(jù)的機(jī)器學(xué)習(xí)算法十分重要。根據(jù)樣本空間是否相同,異源數(shù)據(jù)可以分為同構(gòu)異源數(shù)據(jù)和異構(gòu)異源數(shù)據(jù)。為了解決同源數(shù)據(jù)稀缺問(wèn)題,可以將無(wú)標(biāo)注的樣本通過(guò)眾包方式收集標(biāo)注。每個(gè)參與眾包的標(biāo)注者被視作一個(gè)數(shù)據(jù)源,那么收集到的數(shù)據(jù)就是同構(gòu)異源數(shù)據(jù)。面向這種同構(gòu)異源數(shù)據(jù)的機(jī)器學(xué)習(xí)算法稱為群智學(xué)習(xí)算法。根據(jù)求得目標(biāo)分類器的步驟,群智學(xué)習(xí)算法分為二階段方法和直接方法。個(gè)人分類器方法是群智學(xué)習(xí)直接方法中的代表方法,該算法擁有凸形式的目標(biāo)函數(shù)但對(duì)模型參數(shù)分布做了強(qiáng)假設(shè)。本文提出一種非參數(shù)化的群智學(xué)習(xí)算法。該算法通過(guò)組合優(yōu)化目標(biāo)構(gòu)造出凸形式的目標(biāo)函數(shù),并且沒(méi)有對(duì)模型參數(shù)的分布做任何假設(shè)。另一種整合異源數(shù)據(jù)的方法是其他領(lǐng)域的數(shù)據(jù)來(lái)幫助目標(biāo)領(lǐng)域的模型訓(xùn)練過(guò)程。不同領(lǐng)域的數(shù)據(jù)的樣本空間和分布均不同,因此是異構(gòu)異源數(shù)據(jù)。面向這種同構(gòu)異源數(shù)據(jù)的機(jī)器學(xué)習(xí)算法稱為遷移學(xué)習(xí)。根據(jù)遷移的方式不同遷移學(xué)習(xí)可以分為基于樣本權(quán)重、基于特征表示以及基于模型參數(shù)三類遷移方法。本文研究并提出一種基于模型的遷移方法和一種基于模型和樣本共同遷移的方法。這兩種遷移方法均能利用輔助領(lǐng)域的數(shù)據(jù)改善目標(biāo)領(lǐng)域的模型效果。
[Abstract]:Data homology is the basic assumption of traditional machine learning dependence, that is, training data and test data are distributed from the same, but in real environment homology data is very scarce. Limited homologous data can not train an effective machine learning model, which is the problem of the scarcity of homologous data, and one of the methods to solve the problem is to construct the homologous data manually. But the cost of this method is too high. Another effective way to solve the problem of the scarcity of homologous data is to integrate the heterogeneous data with different distribution to train the machine learning model. Therefore, the machine learning algorithm for heterogeneous data is very important. According to whether the sample space is the same or not, the heterogeneous data can be divided into isomorphic and heterogeneous data. Unannotated samples can be collected by crowdsourcing. Each annotator participating in crowdsourcing is considered as a data source. Then the data collected are isomorphic data. The machine learning algorithm for this kind of isomorphic data is called group intelligence learning algorithm. According to the steps of finding target classifier. The group intelligence learning algorithm is divided into two stages method and the direct method, and the personal classifier method is the representative method in the group intelligence learning direct method. The algorithm has convex form of objective function but makes a strong assumption on the distribution of model parameters. In this paper, a nonparametric group intelligence learning algorithm is proposed. The algorithm constructs convex form of objective function by combining optimization objectives. Another method of integrating heterologous data is the data from other fields to help the model training process in the target domain. The sample space and distribution of the data in different fields are not. Same. The machine learning algorithm for this kind of isomorphism data is called migration learning. According to the different transfer mode migration learning can be divided into sample weight. There are three kinds of migration methods based on feature representation and model parameters. In this paper, a model based migration method and a method based on model and sample migration are proposed. Both of these methods can make use of auxiliary methods. Domain data improves the model effect of the target domain.
【學(xué)位授予單位】:中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 張志強(qiáng);逄居升;謝曉芹;周永;;眾包質(zhì)量控制策略及評(píng)估算法研究[J];計(jì)算機(jī)學(xué)報(bào);2013年08期
,本文編號(hào):1454496
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1454496.html
最近更新
教材專著