繁體中文拼寫(xiě)檢錯(cuò)研究
本文選題:中文語(yǔ)言處理 切入點(diǎn):拼寫(xiě)檢錯(cuò) 出處:《南京郵電大學(xué)》2016年碩士論文
【摘要】:繁體中文拼寫(xiě)檢錯(cuò)指使用計(jì)算機(jī)自動(dòng)檢測(cè)繁體中文文本中是否存在漢字誤用的技術(shù),是中文信息處理領(lǐng)域的一個(gè)重要研究課題,是許多自然語(yǔ)言處理系統(tǒng)的重要部分,如搜索引擎、文字處理軟件等系統(tǒng)。與西方常用的語(yǔ)言如英語(yǔ)相比,中文語(yǔ)言有更加復(fù)雜的語(yǔ)言特性:詞與詞之間沒(méi)有明顯的分隔符、詞語(yǔ)搭配復(fù)雜多樣、語(yǔ)法搭配復(fù)雜多樣,所以繁體中文拼寫(xiě)檢錯(cuò)的研究比英文更加困難。簡(jiǎn)體中文拼寫(xiě)檢錯(cuò)的研究早于繁體中文拼寫(xiě)檢錯(cuò)的研究,所形成的主要方法包括基于規(guī)則、基于統(tǒng)計(jì)、以及基于特征與學(xué)習(xí)的方法,然而這些方法基于簡(jiǎn)體語(yǔ)料庫(kù),并且無(wú)法適用于多種拼寫(xiě)錯(cuò)誤的檢測(cè),因此它們僅能作為參考方法。近年來(lái),隨著繁體中文拼寫(xiě)檢錯(cuò)評(píng)測(cè)的開(kāi)展,繁體中文拼寫(xiě)檢錯(cuò)的研究已經(jīng)漸漸成為中文信息處理領(lǐng)域研究的熱點(diǎn)。本文以檢測(cè)繁體文本中存在的拼寫(xiě)錯(cuò)誤為研究目標(biāo),提出三種有效的檢錯(cuò)方法:(1)首先本文提出一種基于字串切分統(tǒng)計(jì)詞典的檢錯(cuò)方法,利用語(yǔ)料庫(kù)中字串出現(xiàn)的頻率信息作為檢錯(cuò)依據(jù),根據(jù)字串及其頻率信息來(lái)建立統(tǒng)計(jì)詞典,并設(shè)計(jì)了基于統(tǒng)計(jì)規(guī)則評(píng)判的檢錯(cuò)算法。(2)其次本文提出一種基于圖模型與詞性bi-gram模型的繁體中文拼寫(xiě)檢錯(cuò)方法,以中文分詞為基礎(chǔ),將分詞結(jié)果和可疑詞替換結(jié)果以圖模型來(lái)表示,并輔以詞性bi-gram模型來(lái)確定最終錯(cuò)誤字。(3)最后本文針對(duì)常用助詞“的、地、得”的錯(cuò)誤,提出一種基于上下文詞性統(tǒng)計(jì)模型的方法,該方法利用訓(xùn)練語(yǔ)料庫(kù)建立上下文詞性統(tǒng)計(jì)模型,并依據(jù)模型來(lái)判斷助詞使用是否正確。本文以繁體中文拼寫(xiě)評(píng)測(cè)數(shù)據(jù)集為實(shí)驗(yàn)數(shù)據(jù)集,對(duì)提出的三種檢錯(cuò)方法都進(jìn)行了實(shí)驗(yàn)驗(yàn)證,并與現(xiàn)有的檢錯(cuò)方法進(jìn)行對(duì)比,實(shí)驗(yàn)結(jié)果說(shuō)明本文的研究方法可以取得了較好的效果,進(jìn)一步地推動(dòng)了繁體中文拼寫(xiě)檢錯(cuò)技術(shù)的發(fā)展。
[Abstract]:Traditional Chinese spelling and error checking refers to the use of computer to automatically detect the misuse of Chinese characters in traditional Chinese texts, which is an important research topic in the field of Chinese information processing and an important part of many natural language processing systems. Such as search engine, word processing software and so on. Compared with common western languages such as English, Chinese language has more complicated language characteristics: there is no obvious separator between words and words, word collocation is complex and diverse, grammatical collocation is complex and diverse, Therefore, the study of traditional Chinese spelling correction is more difficult than that of English. The simplified Chinese spelling check is earlier than the traditional Chinese spelling check, and the main methods are based on rules and statistics. And the methods based on features and learning, however, these methods are based on simplified corpus and can not be used for the detection of many spelling errors, so they can only be used as reference methods. The research of traditional Chinese spelling correction has gradually become a hot topic in the field of Chinese information processing. This paper aims to detect spelling errors in traditional Chinese texts. First of all, this paper presents an error detection method based on the statistical dictionary of string segmentation, which uses the frequency information of the string in the corpus as the basis of error detection, and establishes a statistical dictionary based on the string and its frequency information. Secondly, this paper proposes a traditional Chinese spelling error detection method based on graph model and part of speech bi-gram model, which is based on Chinese word segmentation. The participle result and suspect word replacement result are represented by graph model, and the final error word is determined by the part of speech bi-gram model. Finally, this paper aims at the common auxiliary word ", ground, get" error. A method based on the statistical model of contextual part-of-speech is proposed, which uses the training corpus to establish the statistical model of contextual part-of-speech. According to the model to judge whether the use of auxiliary words is correct or not. This paper takes the traditional Chinese spelling evaluation data set as the experimental data set, carries on the experimental verification to the proposed three kinds of error detection methods, and carries on the comparison with the existing error detection method. The experimental results show that the research method in this paper can achieve good results and further promote the development of traditional Chinese spelling error detection technology.
【學(xué)位授予單位】:南京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 ;消息[J];互聯(lián)網(wǎng)天地;2011年10期
2 THUNBIRD;IE 5.0“散手”四招[J];電腦愛(ài)好者;1999年06期
3 陳一江;瀏覽器如何顯示多國(guó)文字[J];電腦愛(ài)好者;1998年02期
4 張少斌;好用的多語(yǔ)言平臺(tái)——MView Pro[J];電腦愛(ài)好者;2002年04期
5 李志平;如何在瀏覽器中使用繁體字[J];電子科技;1999年07期
6 ;游界資訊[J];電腦愛(ài)好者;2012年07期
7 廉育功;巧用IE5的多語(yǔ)言功能[J];網(wǎng)絡(luò)與信息;2001年02期
8 大蝦人;實(shí)用軟件技巧薈萃[J];計(jì)算機(jī)與網(wǎng)絡(luò);2003年19期
9 趙江;輕松搞定簡(jiǎn)繁轉(zhuǎn)換[J];電腦愛(ài)好者;2003年16期
10 吳信一;;Word 2003的幾則秘技[J];電腦迷;2004年03期
相關(guān)會(huì)議論文 前1條
1 鄭國(guó)政;;利用現(xiàn)有軟件轉(zhuǎn)換簡(jiǎn)繁體中文的過(guò)程和方法[A];計(jì)算機(jī)技術(shù)與應(yīng)用進(jìn)展·2007——全國(guó)第18屆計(jì)算機(jī)技術(shù)與應(yīng)用(CACIS)學(xué)術(shù)會(huì)議論文集[C];2007年
相關(guān)重要報(bào)紙文章 前10條
1 江蘇 小軍;手機(jī)操作的幾個(gè)誤區(qū)[N];電腦報(bào);2001年
2 張德亮;圖文不符的自行車(chē)商標(biāo)[N];中國(guó)商報(bào);2002年
3 ;紅旗Linux服務(wù)器繁體中文2.0問(wèn)世[N];網(wǎng)絡(luò)世界;2000年
4 河北 趙利;在簡(jiǎn)體中文WinXP下顯示繁體中文[N];電腦報(bào);2002年
5 江蘇 望月;手機(jī)使用中的6個(gè)疑問(wèn)[N];電腦報(bào);2002年
6 萍;華康金蝶100“一字千面”[N];計(jì)算機(jī)世界;2001年
7 廣東 李鋒 ZZG 湖北 劉明;問(wèn)答區(qū)[N];電腦報(bào);2004年
8 ;Help Me[N];電腦報(bào);2004年
9 ;工程師問(wèn)題[N];電腦報(bào);2004年
10 電腦報(bào)評(píng)測(cè)實(shí)驗(yàn)室;SONY Clie NR70J掌上電腦搶鮮測(cè)試[N];電腦報(bào);2002年
相關(guān)碩士學(xué)位論文 前1條
1 王勇;繁體中文拼寫(xiě)檢錯(cuò)研究[D];南京郵電大學(xué);2016年
,本文編號(hào):1672209
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1672209.html