基于CRFs和詞典信息的中古漢語(yǔ)自動(dòng)分詞
發(fā)布時(shí)間:2018-02-26 06:02
本文關(guān)鍵詞: CRFs模型 分詞一致性 中古漢語(yǔ) 自動(dòng)分詞 出處:《數(shù)據(jù)分析與知識(shí)發(fā)現(xiàn)》2017年05期 論文類型:期刊論文
【摘要】:【目的】驗(yàn)證中古時(shí)期分詞一致性和語(yǔ)料類別對(duì)CRFs分詞效率的影響,在此基礎(chǔ)上進(jìn)一步提高分詞效率,降低人工校對(duì)的工作量!痉椒ā恳灾泄艜r(shí)期的史書、佛經(jīng)、小說(shuō)類語(yǔ)料為例,針對(duì)中古漢語(yǔ)的自動(dòng)分詞問(wèn)題,優(yōu)化分詞原則,運(yùn)用CRFs模型和詞典相結(jié)合的方法,消除中古漢語(yǔ)人工分詞結(jié)果中易出現(xiàn)的分詞不一致問(wèn)題;同時(shí)在CRFs分詞中引入字符分類、字典信息兩種特征,并通過(guò)對(duì)比實(shí)驗(yàn)選取每種特征最合適的分詞模板!窘Y(jié)果】實(shí)驗(yàn)結(jié)果顯示,分詞結(jié)果的總F值在封閉測(cè)試中達(dá)到99%以上,開(kāi)放測(cè)試的綜合測(cè)試中也達(dá)到89%-95%!揪窒蕖糠衷~不一致研究主要針對(duì)雙字詞,因此三字以上詞語(yǔ)(多字詞)的識(shí)別效果稍有欠缺!窘Y(jié)論】在有效提高分詞一致性的前提下,字符分類、詞典標(biāo)記特征能夠有效提高中古漢語(yǔ)CRFs分詞的精確度。同時(shí)本文提出的中古漢語(yǔ)分詞系統(tǒng)可以服務(wù)于中古時(shí)期多類別的漢語(yǔ)語(yǔ)料。
[Abstract]:[objective] to verify the influence of word segmentation consistency and corpus classification on the efficiency of CRFs participle, and to further improve the efficiency of word segmentation and reduce the workload of artificial proofreading. [methods] the history books and Buddhist scriptures of the Middle Ancient period were used to improve the efficiency of word segmentation and reduce the workload of artificial proofreading. For the example of novel corpus, aiming at the problem of automatic word segmentation in middle ancient Chinese, the principle of word segmentation is optimized, and the method of combining CRFs model with dictionary is used to eliminate the disconsistency of word segmentation in the result of artificial word segmentation in middle ancient Chinese. At the same time, we introduce character classification and dictionary information into CRFs word segmentation, and select the most suitable segmentation template for each feature by contrast experiment. [results] the experimental results show that the total F value of word segmentation results is more than 99% in the closed test. In the comprehensive test of open test, 89% -95% is also achieved. The research on the inconsistency of participle is mainly aimed at two-character words, so the recognition effect of more than three words (multi-character words) is slightly deficient. [conclusion] on the premise of effectively improving the consistency of participle, Character classification and dictionary tagging features can effectively improve the accuracy of middle ancient Chinese CRFs participle. At the same time, the middle ancient Chinese word segmentation system proposed in this paper can serve for many kinds of Chinese corpus of Middle Ancient Chinese.
【作者單位】: 南京師范大學(xué)文學(xué)院;
【基金】:國(guó)家社會(huì)科學(xué)基金重大項(xiàng)目“漢語(yǔ)史研究語(yǔ)料庫(kù)建設(shè)研究”(項(xiàng)目編號(hào):10&ZD117);國(guó)家社會(huì)科學(xué)基金重大項(xiàng)目“基于《漢學(xué)引得叢刊》的典籍知識(shí)庫(kù)構(gòu)建及人文計(jì)算研究”(項(xiàng)目編號(hào):15ZDB127)的研究成果之一 教育部人文社會(huì)科學(xué)青年項(xiàng)目“漢語(yǔ)歷時(shí)詞匯數(shù)據(jù)庫(kù)的構(gòu)建與計(jì)量研究”(項(xiàng)目編號(hào):16YJC740034)
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 王f捎,
本文編號(hào):1536773
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1536773.html
最近更新
教材專著