雙字低頻未登錄詞識(shí)別研究
本文關(guān)鍵詞: 低頻 雙字 未登錄詞 素性 網(wǎng)絡(luò)檢索 出處:《南京師范大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:未登錄詞是影響中文自動(dòng)分詞精度的最主要原因,低頻詞是未登錄詞識(shí)別的難點(diǎn),而雙字低頻未登錄詞又是低頻未登錄詞的重要組成。所以,文章著重研究如何高效識(shí)別雙字低頻未登錄詞,選用多種統(tǒng)計(jì)和規(guī)則相結(jié)合的方法,取得了一定的效果。 在識(shí)別低頻雙字未登錄詞的過(guò)程中,為了提高識(shí)別效率并對(duì)實(shí)驗(yàn)結(jié)果進(jìn)行有效統(tǒng)計(jì)研究,我們進(jìn)行前期處理,主要分為三步:一、分詞并提取分詞碎片。二、識(shí)別未登錄詞中的重要組成——命名實(shí)體。三、識(shí)別部分多字未登錄詞。之后我們?cè)谒K槠信袆e低頻雙字未登錄詞,采用了多種統(tǒng)計(jì)與規(guī)則相結(jié)合的辦法,有互信息、成詞非詞概率、鄰字熵、素性組合。雖然實(shí)驗(yàn)結(jié)果一般,但在輔助識(shí)別、提取新詞上依然具有實(shí)用的價(jià)值,可以為人工識(shí)別減輕大量負(fù)擔(dān)。我們?cè)谧R(shí)別過(guò)程中發(fā)現(xiàn),詞定義的模糊性、語(yǔ)料中分詞不一致是雙字未登錄詞難以正確識(shí)別的重要原因,因此,我們對(duì)此進(jìn)行了深入的研究,提出了對(duì)雙字詞的新的合理定義。之后,我們自己標(biāo)注了小型的測(cè)試語(yǔ)料,在同樣的識(shí)別方法下,正確率和召回率都有較大提高。最后我們還提出并實(shí)現(xiàn)了一種基于網(wǎng)絡(luò)的判別方法,對(duì)“結(jié)合緊密、使用穩(wěn)定”這一屬性進(jìn)行了量化,該方法在判定雙字低頻未登錄詞的實(shí)驗(yàn)中表現(xiàn)出色,F值最高達(dá)到了86%。可見(jiàn),使用網(wǎng)絡(luò)資源可能是提高自動(dòng)分詞、特別是未登錄詞自動(dòng)識(shí)別效果的突破口。
[Abstract]:The unrecorded word is the main reason that affects the precision of Chinese automatic word segmentation, the low frequency word is the difficulty of identifying the unrecorded word, and the double word low frequency unrecorded word is the important component of the low frequency unrecorded word. This paper focuses on how to efficiently identify low frequency unrecorded words with double characters and select a variety of methods combining statistics and rules to achieve certain results. In the process of identifying low-frequency double-word unrecorded words, in order to improve the efficiency of recognition and carry on the effective statistical research on the experimental results, we carry out preliminary processing, mainly divided into three steps: first, participle and extract the fragment of participle. Identify the important component of the unrecorded word named entity. Third, identify part of the multi-word unentered word. Then we distinguish the low-frequency double-word word from the remaining fragments. We adopt a variety of methods combining statistics and rules, and have mutual information. Although the experimental results are general, they still have practical value in auxiliary recognition and extraction of new words, which can lighten a large amount of burden for manual recognition. The ambiguity of the definition of words and the inconsistent segmentation in the corpus are the important reasons why it is difficult to recognize the double-character unrecorded words correctly. Therefore, we have made a deep research on this and put forward a new and reasonable definition of double-character words. We annotate the small test corpus, and under the same recognition method, the correct rate and recall rate are improved greatly. Finally, we propose and implement a network-based discriminant method. This method has been quantized by using the attribute of "stable". This method has performed well in the experiment of judging double-character low-frequency unrecorded words, and the highest F value has reached 860.It can be seen that the use of network resources may be to improve the automatic word segmentation. Especially the breakthrough of automatic recognition effect of unrecorded words.
【學(xué)位授予單位】:南京師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:H08
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 宋作艷;;字族化與漢語(yǔ)未登錄詞的自動(dòng)提取[J];北京大學(xué)學(xué)報(bào)(哲學(xué)社會(huì)科學(xué)版);2007年02期
2 胡俊峰,俞士汶;唐宋詩(shī)之計(jì)算機(jī)輔助深層研究[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2001年05期
3 羅智勇;宋柔;;基于多特征的自適應(yīng)新詞識(shí)別[J];北京工業(yè)大學(xué)學(xué)報(bào);2007年07期
4 朱靖波,張玫杰,姚天順;一種基于NA假設(shè)的訓(xùn)練數(shù)據(jù)自動(dòng)構(gòu)造方法[J];東北大學(xué)學(xué)報(bào);1999年04期
5 孫茂松,鄒嘉彥;漢語(yǔ)自動(dòng)分詞研究評(píng)述[J];當(dāng)代語(yǔ)言學(xué);2001年01期
6 侯漢清,薛鵬軍;基于知識(shí)庫(kù)的網(wǎng)頁(yè)自動(dòng)標(biāo)引和自動(dòng)分類系統(tǒng)的設(shè)計(jì)[J];大學(xué)圖書館學(xué)報(bào);2004年01期
7 馬穎華,王永成,蘇貴洋;一種在漢語(yǔ)文本中抽取重復(fù)字串的快速算法[J];電子學(xué)報(bào);2002年S1期
8 呂學(xué)強(qiáng),張樂(lè),黃志丹,胡俊峰;基于散列技術(shù)的快速子串歸并算法[J];復(fù)旦學(xué)報(bào)(自然科學(xué)版);2004年05期
9 胡婕;李躍新;;數(shù)據(jù)庫(kù)受限漢語(yǔ)自然語(yǔ)言查詢的分詞研究與實(shí)現(xiàn)[J];湖北大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年04期
10 馬光志,李專;基于特征詞的自動(dòng)分詞研究[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2003年03期
,本文編號(hào):1503126
本文鏈接:http://sikaile.net/wenyilunwen/hanyulw/1503126.html