融合規(guī)則與統(tǒng)計的微博新詞發(fā)現(xiàn)方法
發(fā)布時間:2018-03-19 01:30
本文選題:微博新詞 切入點:構(gòu)詞規(guī)則 出處:《計算機應(yīng)用》2017年04期 論文類型:期刊論文
【摘要】:結(jié)合微博新詞的構(gòu)詞規(guī)則自由度大和極其復(fù)雜的特點,針對傳統(tǒng)的C/NC-value方法抽取的結(jié)果新詞邊界的識別準確率不高,以及低頻微博新詞無法正確識別的問題,提出了一種融合人工啟發(fā)式規(guī)則、C/NC-value改進算法和條件隨機場(CRF)模型的微博新詞抽取方法。一方面,人工啟發(fā)式規(guī)則是指對微博新詞的分類和歸納總結(jié),并從微博新詞構(gòu)詞的詞性(POS)、字符類別和表意符號等角度設(shè)計的微博新詞的構(gòu)詞規(guī)則;另一方面,改進的C/NC-value方法通過引入詞頻、鄰接熵和互信息等統(tǒng)計量來重構(gòu)NC-value目標函數(shù),并使用CRF模型訓(xùn)練和識別新詞,最終達到提高新詞邊界識別準確率和低頻新詞識別精度的目的。實驗結(jié)果顯示,與傳統(tǒng)方法相比,所提出的方法能有效地提高微博新詞識別的F值。
[Abstract]:According to the characteristics of Weibo's great freedom and complexity of word formation rules, aiming at the problem that the recognition accuracy of the boundary of new words extracted by the traditional C / NC-value method is not high, and the problem that the low frequency Weibo new words cannot be correctly recognized, This paper presents a new word extraction method for Weibo, which combines the improved C / NC-value algorithm of artificial heuristic rule and conditional random field CRF model. On the one hand, artificial heuristic rule refers to the classification and summarization of Weibo new words. On the other hand, the improved C / NC-value method reconstructs the NC-value objective function by introducing the statistics of word frequency, contiguous entropy and mutual information, etc. The CRF model is used to train and recognize new words, which can improve the accuracy of boundary recognition and the accuracy of low frequency new words recognition. The experimental results show that, compared with the traditional methods, The proposed method can effectively improve the F value of Weibo's new word recognition.
【作者單位】: 北京交通大學(xué)計算機與信息技術(shù)學(xué)院;
【基金】:國家自然科學(xué)基金資助項目(61370130,61473294) 中央高;究蒲袠I(yè)務(wù)費專項資金資助項目(2014RC040) 科學(xué)技術(shù)部國際科技合作計劃項目(K11F100010)~~
【分類號】:TP391.1;TP393.092
,
本文編號:1632243
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1632243.html
最近更新
教材專著