一種基于動態(tài)詞匯表的在線LDA算法
發(fā)布時間:2018-11-07 11:33
【摘要】:目前的在線潛在狄利克雷分布模型(LDA)算法大多是基于固定的詞匯表,在實際應用中經常會出現詞匯表和處理的語料不匹配的情況,影響了模型的實用性。針對這個現象,在置信傳播算法(BP)的框架下,使主題單詞分布服從狄利克雷過程,重新推導公式,使得詞匯表在模型運行之前為空,并且在處理時不斷向詞匯表中增加發(fā)現的新詞。實驗證明,這種新的基于動態(tài)詞匯表的算法不僅使得詞匯表與語料的貼合度更高,而且使其在混淆度以及互信息指數這兩個指標上能夠比基于固定詞匯表的LDA模型表現得更加優(yōu)越。
[Abstract]:At present, most of the online potential Delikley distribution model (LDA) algorithms are based on a fixed vocabulary, and the mismatch between the vocabulary and the processed corpus often occurs in practical applications, which affects the practicability of the model. In order to solve this problem, under the framework of confidence propagation algorithm (BP), we rederive the formula from the Delikley process to make the vocabulary empty before the model runs. And in the processing of the vocabulary to continue to add new words found. Experimental results show that the new algorithm based on dynamic vocabulary not only makes the consistency of vocabulary and corpus higher, Moreover, it is superior to the LDA model based on fixed vocabulary in terms of the degree of confusion and mutual information index.
【作者單位】: 蘇州大學計算機科學與技術學院;
【基金】:國家自然科學基金(61373092,61572339,61272449) 江蘇省科技支撐計劃重點項目(BE2014005)資助
【分類號】:TP391.1
,
本文編號:2316236
[Abstract]:At present, most of the online potential Delikley distribution model (LDA) algorithms are based on a fixed vocabulary, and the mismatch between the vocabulary and the processed corpus often occurs in practical applications, which affects the practicability of the model. In order to solve this problem, under the framework of confidence propagation algorithm (BP), we rederive the formula from the Delikley process to make the vocabulary empty before the model runs. And in the processing of the vocabulary to continue to add new words found. Experimental results show that the new algorithm based on dynamic vocabulary not only makes the consistency of vocabulary and corpus higher, Moreover, it is superior to the LDA model based on fixed vocabulary in terms of the degree of confusion and mutual information index.
【作者單位】: 蘇州大學計算機科學與技術學院;
【基金】:國家自然科學基金(61373092,61572339,61272449) 江蘇省科技支撐計劃重點項目(BE2014005)資助
【分類號】:TP391.1
,
本文編號:2316236
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2316236.html