融合規(guī)則與統(tǒng)計的微博新詞發(fā)現(xiàn)方法
發(fā)布時間:2018-03-19 01:30
本文選題:微博新詞 切入點:構詞規(guī)則 出處:《計算機應用》2017年04期 論文類型:期刊論文
【摘要】:結合微博新詞的構詞規(guī)則自由度大和極其復雜的特點,針對傳統(tǒng)的C/NC-value方法抽取的結果新詞邊界的識別準確率不高,以及低頻微博新詞無法正確識別的問題,提出了一種融合人工啟發(fā)式規(guī)則、C/NC-value改進算法和條件隨機場(CRF)模型的微博新詞抽取方法。一方面,人工啟發(fā)式規(guī)則是指對微博新詞的分類和歸納總結,并從微博新詞構詞的詞性(POS)、字符類別和表意符號等角度設計的微博新詞的構詞規(guī)則;另一方面,改進的C/NC-value方法通過引入詞頻、鄰接熵和互信息等統(tǒng)計量來重構NC-value目標函數(shù),并使用CRF模型訓練和識別新詞,最終達到提高新詞邊界識別準確率和低頻新詞識別精度的目的。實驗結果顯示,與傳統(tǒng)方法相比,所提出的方法能有效地提高微博新詞識別的F值。
[Abstract]:According to the characteristics of Weibo's great freedom and complexity of word formation rules, aiming at the problem that the recognition accuracy of the boundary of new words extracted by the traditional C / NC-value method is not high, and the problem that the low frequency Weibo new words cannot be correctly recognized, This paper presents a new word extraction method for Weibo, which combines the improved C / NC-value algorithm of artificial heuristic rule and conditional random field CRF model. On the one hand, artificial heuristic rule refers to the classification and summarization of Weibo new words. On the other hand, the improved C / NC-value method reconstructs the NC-value objective function by introducing the statistics of word frequency, contiguous entropy and mutual information, etc. The CRF model is used to train and recognize new words, which can improve the accuracy of boundary recognition and the accuracy of low frequency new words recognition. The experimental results show that, compared with the traditional methods, The proposed method can effectively improve the F value of Weibo's new word recognition.
【作者單位】: 北京交通大學計算機與信息技術學院;
【基金】:國家自然科學基金資助項目(61370130,61473294) 中央高校基本科研業(yè)務費專項資金資助項目(2014RC040) 科學技術部國際科技合作計劃項目(K11F100010)~~
【分類號】:TP391.1;TP393.092
,
本文編號:1632243
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1632243.html
最近更新
教材專著