基于實(shí)例的領(lǐng)域適應(yīng)增量學(xué)習(xí)方法研究
發(fā)布時(shí)間:2018-04-03 20:37
本文選題:文本分類 切入點(diǎn):實(shí)例遷移 出處:《南京理工大學(xué)》2017年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的高速發(fā)展,人們能夠在互聯(lián)網(wǎng)上獲取到的信息與日俱增。信息的爆炸式增長(zhǎng)有利也有弊,如何高效且充分地利用這些信息成為學(xué)術(shù)界和工業(yè)界亟待解決的問(wèn)題。文本分類是解決此類問(wèn)題的一種比較常用技術(shù),按照學(xué)習(xí)的方式可以分為領(lǐng)域特定和領(lǐng)域適應(yīng)文本分類。目前已有許多基于實(shí)例遷移的領(lǐng)域適應(yīng)算法,然此類方法存在一個(gè)共性的現(xiàn)象,即實(shí)例權(quán)重過(guò)度學(xué)習(xí)造成的過(guò)擬合問(wèn)題。據(jù)了解,目前還沒(méi)有任何工作明確討論過(guò)該問(wèn)題,本文將對(duì)此進(jìn)行系統(tǒng)的研究。另外,在自然語(yǔ)言處理領(lǐng)域,傳統(tǒng)的統(tǒng)計(jì)機(jī)器學(xué)習(xí)模型通常是單任務(wù)的,即模型是從訓(xùn)練數(shù)據(jù)中一次性地學(xué)習(xí)得到的。這無(wú)疑限制了算法的泛化性與可擴(kuò)展性,本文將針對(duì)該弊端進(jìn)行增量式改進(jìn)。首先,本文介紹了當(dāng)前有代表性的基于實(shí)例的領(lǐng)域適應(yīng)算法ILA,并在此基礎(chǔ)上提出了正則化方法以強(qiáng)化遷移學(xué)習(xí)的效果。正則化方法分為六種子方法:三種基于Early-stopping的方法;兩種懲罰因子作為ILA模型正則項(xiàng)的方法;Dropout Training引入實(shí)例加權(quán)學(xué)習(xí)中的方法。文本分類實(shí)驗(yàn)結(jié)果表明,正則化方法一定程度上都能夠提高該實(shí)例遷移算法的性能,其中Dropout Training的效果最為顯著。其次,針對(duì)領(lǐng)域適應(yīng)中權(quán)重學(xué)習(xí)的過(guò)擬合問(wèn)題,本文進(jìn)行了系統(tǒng)的研究。雖然上述的正則化方法能夠變相緩解過(guò)擬合問(wèn)題,但并不能解決根本問(wèn)題,且嚴(yán)重限制了算法的效率和適應(yīng)性。因此,本文提出了基于損失函數(shù)懲罰的方法,根據(jù)實(shí)例的權(quán)重進(jìn)行不同程度的損失函數(shù)懲罰。實(shí)驗(yàn)結(jié)果表明,基于損失函數(shù)懲罰的方法不僅能夠明顯改善過(guò)擬合問(wèn)題,且具有較強(qiáng)的適應(yīng)性和較高的效率,其中基于少數(shù)權(quán)重較大樣本的損失函數(shù)懲罰方法效果是最優(yōu)且最穩(wěn)定的。最后,本文提出了一種基于終生學(xué)習(xí)的增量式樸素貝葉斯模型,在傳統(tǒng)的樸素貝葉斯模型的基礎(chǔ)上,提出了增量式的模型參數(shù)更新方式和終生式學(xué)習(xí)機(jī)制。該模型能夠存儲(chǔ)大規(guī)模歷史任務(wù)中學(xué)習(xí)到的知識(shí),有效輔助少量有樣本標(biāo)注的新任務(wù)的學(xué)習(xí),并以增量的方式更新參數(shù),每次學(xué)習(xí)只需更新歷史模型卻不必重復(fù)訓(xùn)練歷史數(shù)據(jù)。在文本分類上的實(shí)驗(yàn)結(jié)果表明,該模型不僅能夠增量式地利用過(guò)去任務(wù)中學(xué)習(xí)到的知識(shí)指導(dǎo)新任務(wù)的學(xué)習(xí),而且還具有較好的新特征處理和領(lǐng)域自適應(yīng)能力。
[Abstract]:With the rapid development of Internet technology, people can get more and more information on the Internet.The explosive growth of information has both advantages and disadvantages. How to make full use of this information efficiently and fully becomes an urgent problem to be solved in academia and industry.Text classification is a common technique to solve this kind of problem. It can be divided into domain specific and domain adaptive text classification according to the learning method.At present, there are many domain adaptation algorithms based on case migration, but there is a common phenomenon in this kind of methods, that is, the over-fitting problem caused by over-learning of case weights.It is understood that there has not been any work to discuss this problem explicitly, this paper will do a systematic study on it.In addition, in the field of natural language processing, the traditional statistical machine learning model is usually single-task, that is, the model is obtained from the training data in one time.This undoubtedly limits the generalization and extensibility of the algorithm.Firstly, this paper introduces the representative case-based domain adaptation algorithm ILA, and proposes a regularization method to enhance the effect of migration learning.The regularization method is divided into six submethods: three methods based on Early-stopping and two penalty factors as regular terms of ILA model.The results of text classification experiments show that the regularization method can improve the performance of the instance migration algorithm to some extent, and the effect of Dropout Training is the most significant.Secondly, aiming at the problem of over-fitting of weight learning in domain adaptation, this paper makes a systematic study.Although the above regularization method can alleviate the over-fitting problem in a disguised form, it can not solve the fundamental problem, and severely limits the efficiency and adaptability of the algorithm.Therefore, this paper presents a method of penalty based on loss function, which is based on the weight of an example.The experimental results show that the penalty method based on loss function can not only obviously improve the over-fitting problem, but also has strong adaptability and high efficiency.Among them, the penalty effect of loss function based on a few large weight samples is optimal and stable.Finally, an incremental naive Bayesian model based on lifelong learning is proposed. Based on the traditional naive Bayesian model, the incremental model parameter updating method and lifelong learning mechanism are proposed.The model can store the knowledge learned from large-scale history tasks, effectively assist the learning of a small number of new tasks with sample tagging, and update the parameters in an incremental manner. Each learning process only needs to update the historical model without repeatedly training the historical data.The experimental results on text classification show that the model can not only make incremental use of the knowledge learned in the past tasks to guide the learning of new tasks, but also have better ability of new feature processing and domain adaptation.
【學(xué)位授予單位】:南京理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 許明英;尉永清;趙靜;;一種結(jié)合反饋信息的貝葉斯分類增量學(xué)習(xí)方法[J];計(jì)算機(jī)應(yīng)用;2011年09期
2 羅福星;劉衛(wèi)國(guó);;一種樸素貝葉斯分類增量學(xué)習(xí)算法[J];微計(jì)算機(jī)應(yīng)用;2008年06期
3 姜卯生,王浩,姚宏亮;樸素貝葉斯分類器增量學(xué)習(xí)序列算法研究[J];計(jì)算機(jī)工程與應(yīng)用;2004年14期
4 宮秀軍,劉少輝,史忠植;一種增量貝葉斯分類模型[J];計(jì)算機(jī)學(xué)報(bào);2002年06期
,本文編號(hào):1706812
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1706812.html
最近更新
教材專著