基于實例的領(lǐng)域適應(yīng)增量學(xué)習(xí)方法研究

發(fā)布時間：2018-04-03 20:37

本文選題：文本分類　切入點：實例遷移　出處：《南京理工大學(xué)》2017年碩士論文

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的高速發(fā)展,人們能夠在互聯(lián)網(wǎng)上獲取到的信息與日俱增。信息的爆炸式增長有利也有弊,如何高效且充分地利用這些信息成為學(xué)術(shù)界和工業(yè)界亟待解決的問題。文本分類是解決此類問題的一種比較常用技術(shù),按照學(xué)習(xí)的方式可以分為領(lǐng)域特定和領(lǐng)域適應(yīng)文本分類。目前已有許多基于實例遷移的領(lǐng)域適應(yīng)算法,然此類方法存在一個共性的現(xiàn)象,即實例權(quán)重過度學(xué)習(xí)造成的過擬合問題。據(jù)了解,目前還沒有任何工作明確討論過該問題,本文將對此進(jìn)行系統(tǒng)的研究。另外,在自然語言處理領(lǐng)域,傳統(tǒng)的統(tǒng)計機(jī)器學(xué)習(xí)模型通常是單任務(wù)的,即模型是從訓(xùn)練數(shù)據(jù)中一次性地學(xué)習(xí)得到的。這無疑限制了算法的泛化性與可擴(kuò)展性,本文將針對該弊端進(jìn)行增量式改進(jìn)。首先,本文介紹了當(dāng)前有代表性的基于實例的領(lǐng)域適應(yīng)算法ILA,并在此基礎(chǔ)上提出了正則化方法以強化遷移學(xué)習(xí)的效果。正則化方法分為六種子方法:三種基于Early-stopping的方法;兩種懲罰因子作為ILA模型正則項的方法;Dropout Training引入實例加權(quán)學(xué)習(xí)中的方法。文本分類實驗結(jié)果表明,正則化方法一定程度上都能夠提高該實例遷移算法的性能,其中Dropout Training的效果最為顯著。其次,針對領(lǐng)域適應(yīng)中權(quán)重學(xué)習(xí)的過擬合問題,本文進(jìn)行了系統(tǒng)的研究。雖然上述的正則化方法能夠變相緩解過擬合問題,但并不能解決根本問題,且嚴(yán)重限制了算法的效率和適應(yīng)性。因此,本文提出了基于損失函數(shù)懲罰的方法,根據(jù)實例的權(quán)重進(jìn)行不同程度的損失函數(shù)懲罰。實驗結(jié)果表明,基于損失函數(shù)懲罰的方法不僅能夠明顯改善過擬合問題,且具有較強的適應(yīng)性和較高的效率,其中基于少數(shù)權(quán)重較大樣本的損失函數(shù)懲罰方法效果是最優(yōu)且最穩(wěn)定的。最后,本文提出了一種基于終生學(xué)習(xí)的增量式樸素貝葉斯模型,在傳統(tǒng)的樸素貝葉斯模型的基礎(chǔ)上,提出了增量式的模型參數(shù)更新方式和終生式學(xué)習(xí)機(jī)制。該模型能夠存儲大規(guī)模歷史任務(wù)中學(xué)習(xí)到的知識,有效輔助少量有樣本標(biāo)注的新任務(wù)的學(xué)習(xí),并以增量的方式更新參數(shù),每次學(xué)習(xí)只需更新歷史模型卻不必重復(fù)訓(xùn)練歷史數(shù)據(jù)。在文本分類上的實驗結(jié)果表明,該模型不僅能夠增量式地利用過去任務(wù)中學(xué)習(xí)到的知識指導(dǎo)新任務(wù)的學(xué)習(xí),而且還具有較好的新特征處理和領(lǐng)域自適應(yīng)能力。
[Abstract]:With the rapid development of Internet technology, people can get more and more information on the Internet.The explosive growth of information has both advantages and disadvantages. How to make full use of this information efficiently and fully becomes an urgent problem to be solved in academia and industry.Text classification is a common technique to solve this kind of problem. It can be divided into domain specific and domain adaptive text classification according to the learning method.At present, there are many domain adaptation algorithms based on case migration, but there is a common phenomenon in this kind of methods, that is, the over-fitting problem caused by over-learning of case weights.It is understood that there has not been any work to discuss this problem explicitly, this paper will do a systematic study on it.In addition, in the field of natural language processing, the traditional statistical machine learning model is usually single-task, that is, the model is obtained from the training data in one time.This undoubtedly limits the generalization and extensibility of the algorithm.Firstly, this paper introduces the representative case-based domain adaptation algorithm ILA, and proposes a regularization method to enhance the effect of migration learning.The regularization method is divided into six submethods: three methods based on Early-stopping and two penalty factors as regular terms of ILA model.The results of text classification experiments show that the regularization method can improve the performance of the instance migration algorithm to some extent, and the effect of Dropout Training is the most significant.Secondly, aiming at the problem of over-fitting of weight learning in domain adaptation, this paper makes a systematic study.Although the above regularization method can alleviate the over-fitting problem in a disguised form, it can not solve the fundamental problem, and severely limits the efficiency and adaptability of the algorithm.Therefore, this paper presents a method of penalty based on loss function, which is based on the weight of an example.The experimental results show that the penalty method based on loss function can not only obviously improve the over-fitting problem, but also has strong adaptability and high efficiency.Among them, the penalty effect of loss function based on a few large weight samples is optimal and stable.Finally, an incremental naive Bayesian model based on lifelong learning is proposed. Based on the traditional naive Bayesian model, the incremental model parameter updating method and lifelong learning mechanism are proposed.The model can store the knowledge learned from large-scale history tasks, effectively assist the learning of a small number of new tasks with sample tagging, and update the parameters in an incremental manner. Each learning process only needs to update the historical model without repeatedly training the historical data.The experimental results on text classification show that the model can not only make incremental use of the knowledge learned in the past tasks to guide the learning of new tasks, but also have better ability of new feature processing and domain adaptation.
【學(xué)位授予單位】：南京理工大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP181

【參考文獻(xiàn)】

相關(guān)期刊論文前4條

1 許明英;尉永清;趙靜;;一種結(jié)合反饋信息的貝葉斯分類增量學(xué)習(xí)方法[J];計算機(jī)應(yīng)用;2011年09期

2 羅福星;劉衛(wèi)國;;一種樸素貝葉斯分類增量學(xué)習(xí)算法[J];微計算機(jī)應(yīng)用;2008年06期

3 姜卯生,王浩,姚宏亮;樸素貝葉斯分類器增量學(xué)習(xí)序列算法研究[J];計算機(jī)工程與應(yīng)用;2004年14期

4 宮秀軍,劉少輝,史忠植;一種增量貝葉斯分類模型[J];計算機(jī)學(xué)報;2002年06期

，

本文編號：1706812

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1706812.html

上一篇：基于蟻群算法的可擴(kuò)展多目標(biāo)土地利用優(yōu)化配置
下一篇：基于PLC的多缸壓力檢定實驗平臺研制

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于實例的領(lǐng)域適應(yīng)增量學(xué)習(xí)方法研究