基于Aho-Corasick自動機算法的概率模型中文分詞CPACA算法
發(fā)布時間:2018-08-28 15:49
【摘要】:Aho-Corasick自動機算法是著名的多模式串匹配算法,它在模式串失配時,通過fail指針轉移至有效的后續(xù)狀態(tài),存在一個或多個有效的后續(xù)狀態(tài)可能。據此特性,該文提出了一種適應于中文分詞的自動機算法。該算法使用動態(tài)規(guī)劃的方法,計算上下文匹配概率,轉移至最佳的有效后續(xù)狀態(tài),即實現了基于字符串匹配的機械分詞方法與基于統(tǒng)計概率模型的方法結合。實驗結果表明,該算法分詞準確率高。
[Abstract]:The Aho-Corasick automaton algorithm is a famous multi-pattern string matching algorithm. When the pattern string mismatches, it can be transferred to an effective subsequent state by fail pointer, and there are one or more effective follow-up states. In this paper, an automaton algorithm for Chinese word segmentation is proposed. The algorithm uses dynamic programming method to calculate the context matching probability and transfer to the best effective follow-up state, that is, the combination of mechanical word segmentation method based on string matching and statistical probability model is realized. Experimental results show that the algorithm has high accuracy.
【作者單位】: 女王大學工程與應用科學學院;
【分類號】:TP391.1
,
本文編號:2209871
[Abstract]:The Aho-Corasick automaton algorithm is a famous multi-pattern string matching algorithm. When the pattern string mismatches, it can be transferred to an effective subsequent state by fail pointer, and there are one or more effective follow-up states. In this paper, an automaton algorithm for Chinese word segmentation is proposed. The algorithm uses dynamic programming method to calculate the context matching probability and transfer to the best effective follow-up state, that is, the combination of mechanical word segmentation method based on string matching and statistical probability model is realized. Experimental results show that the algorithm has high accuracy.
【作者單位】: 女王大學工程與應用科學學院;
【分類號】:TP391.1
,
本文編號:2209871
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2209871.html