中文自動分詞技術(shù)的改進與優(yōu)化研究

發(fā)布時間：2018-11-10 09:57

【摘要】：中文自動分詞技術(shù)是中文信息處理領(lǐng)域中一項重要的基礎(chǔ)性課題，它對相關(guān)領(lǐng)域（如信息抽取、全文檢索、數(shù)據(jù)挖掘、機器翻譯、問答系統(tǒng)等領(lǐng)域）的研究有著巨大的推動作用。本文對中文自動分詞領(lǐng)域涉及的主要技術(shù)進行了比較全面和仔細的研究，包括中文自動分詞的詞典結(jié)構(gòu)、中文自動分詞的分詞算法；對中文分詞中的難點問題進行了相對深入的研究；最后結(jié)合當前熱門的搜索引擎技術(shù)，講述了中文自動分詞技術(shù)在這個領(lǐng)域的應(yīng)用。本文的主要貢獻如下：首先，本文對中文自動分詞技術(shù)中的詞典結(jié)構(gòu)進行了廣泛和深入的研究，在綜合逐字二分、逐詞二分和Trie索引樹三種經(jīng)典詞典結(jié)構(gòu)的基礎(chǔ)上，又借鑒和學習了眾多改進的詞典機制，，最后提出了一種基于多哈希平衡二叉查找樹的分詞詞典機制。其次，本文在命名實體識別方面進行了重點突破。在中文人名識別上，結(jié)合和借鑒現(xiàn)有的研究結(jié)果，設(shè)計了一種新的分階段的中文人名識別方法，并給出了具體的實現(xiàn)過程。在中文機構(gòu)名識別方面，本文在CRF統(tǒng)計模型的基礎(chǔ)之上，融入語言學領(lǐng)域的規(guī)則和知識，設(shè)計和實現(xiàn)了基于CRF和規(guī)則的中文醫(yī)療機構(gòu)名識別系統(tǒng)。實驗結(jié)果顯示，封閉測試的準確率和召回率分別達到了91.68%和95.21%，給領(lǐng)域機構(gòu)名的識別提供了一種切實可行的新思路。最后，結(jié)合當今社會對海量信息檢索的迫切需求，對中文自動分詞技術(shù)在搜索引擎領(lǐng)域的應(yīng)用做了比較詳細的介紹，一方面推廣了中文自動分詞技術(shù)，另一方面也為搜索引擎未來的優(yōu)化和發(fā)展做了一個很好的指向。
[Abstract]:Chinese automatic word segmentation technology is an important basic topic in the field of Chinese information processing. It provides information extraction, full-text retrieval, data mining, machine translation to related fields, such as information extraction, full-text retrieval, data mining, and machine translation. Question and answer system and other fields) has a great role in promoting the research. In this paper, the main technologies involved in the field of Chinese automatic word segmentation are studied comprehensively and carefully, including the dictionary structure of Chinese automatic word segmentation, the word segmentation algorithm of Chinese automatic word segmentation; The difficult problems in Chinese word segmentation are studied deeply. Finally, the application of Chinese automatic word segmentation technology in this field is described in combination with the popular search engine technology. The main contributions of this paper are as follows: firstly, the dictionary structure of Chinese automatic word segmentation is studied extensively and deeply, which is based on three classical dictionaries: word by word dichotomy, word by word dichotomy and Trie index tree. Finally, a word segmentation dictionary mechanism based on multi-hash balanced binary search tree is proposed. Secondly, this paper has carried on the key breakthrough in the naming entity recognition aspect. In the aspect of Chinese personal name recognition, a new method of Chinese personal name recognition is designed based on the existing research results, and the realization process is given. In the aspect of Chinese institution name recognition, this paper designs and implements a Chinese medical institution name recognition system based on CRF and rules, which is based on the CRF statistical model, and integrates the rules and knowledge in the field of linguistics. The experimental results show that the accuracy and recall rate of closed test are 91.68% and 95.2121% respectively. Finally, according to the urgent need of mass information retrieval in today's society, the application of Chinese automatic word segmentation technology in search engine is introduced in detail. On the one hand, the Chinese automatic word segmentation technology is popularized. On the other hand, it also makes a good point for the future optimization and development of search engine.
【學位授予單位】：江蘇科技大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP391.1

【參考文獻】

相關(guān)期刊論文前10條

1 孫茂松,鄒嘉彥;漢語自動分詞研究評述[J];當代語言學;2001年01期

2 林亞平,劉云中,周順先,陳治平,蔡立軍;基于最大熵的隱馬爾可夫模型文本信息抽取[J];電子學報;2005年02期

3 周俊生;戴新宇;尹存燕;陳家駿;;基于層疊條件隨機場模型的中文機構(gòu)名自動識別[J];電子學報;2006年05期

4 馬哲,姚敏;一種改進的基于PATRICIA樹的漢語自動分詞詞典機制[J];華南理工大學學報(自然科學版);2004年S1期

5 駱衛(wèi)華,羅振聲,宮小瑾;中文文本自動校對技術(shù)的研究[J];計算機研究與發(fā)展;2004年01期

6 劉群,張華平,俞鴻魁,程學旗;基于層疊隱馬模型的漢語詞法分析[J];計算機研究與發(fā)展;2004年08期

7 羅智勇;宋柔;;現(xiàn)代漢語通用分詞系統(tǒng)中歧義切分的實用技術(shù)[J];計算機研究與發(fā)展;2006年06期

8 李振星,徐澤平,唐衛(wèi)清,唐榮錫;全二分最大匹配快速分詞算法[J];計算機工程與應(yīng)用;2002年11期

9 張華平,劉群;基于角色標注的中國人名自動識別研究[J];計算機學報;2004年01期

10 王瑞雷;欒靜;潘曉花;盧修配;;一種改進的中文分詞正向最大匹配算法[J];計算機應(yīng)用與軟件;2011年03期

本文編號：2322110

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2322110.html

上一篇：基于用戶行為分析的校園網(wǎng)搜索引擎排序方法
下一篇：長江委信息化頂層設(shè)計中信息新技術(shù)的運用

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

中文自動分詞技術(shù)的改進與優(yōu)化研究