領(lǐng)域自適應(yīng)中文分詞系統(tǒng)的研究與實(shí)現(xiàn)
本文關(guān)鍵詞: 中文分詞 多模型 字標(biāo)注 領(lǐng)域自適應(yīng) 特征嵌入 出處:《沈陽航空航天大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:中文分詞是指將連續(xù)的字序列依照特定的規(guī)范切分為合理的詞序列的過程。作為自然語言處理最基本的一個(gè)步驟,是信息檢索、知識(shí)獲取以及機(jī)器翻譯等應(yīng)用必須處理的關(guān)鍵環(huán)節(jié)。因此,研究中文分詞具有重要的理論和現(xiàn)實(shí)意義。本文提出了一種基于字的多模型分詞方法。該方法采用神經(jīng)網(wǎng)絡(luò)模型結(jié)構(gòu)針對(duì)每個(gè)字單獨(dú)建立模型。由于中文漢字本身帶有語義信息,不同的字在不同語境中其含義和作用不同,造成每個(gè)字的構(gòu)詞規(guī)律存在差異。與現(xiàn)有字標(biāo)注分詞方法不同的是,該方法能夠有效區(qū)分每個(gè)特征對(duì)不同待切分字的影響,從而學(xué)習(xí)出字構(gòu)詞的特殊性規(guī)律。通過與單模型方法、CRF方法以及前人的工作進(jìn)行對(duì)比,本文提出的基于字的多模型方法取得了更好的分詞效果。并在SIGHAN Backoff2005提供的中文簡體語料PKU和MSR上,取得的F值分別為93.4%和95.5%。根據(jù)上述方法,面向領(lǐng)域自適應(yīng)分詞任務(wù),本文提出了一種基于字的領(lǐng)域自適應(yīng)分詞方法。由于字模型相互獨(dú)立,模型更新時(shí),保留遷移性能強(qiáng)的字模型,對(duì)遷移性能弱的字模型進(jìn)行更新訓(xùn)練。解決了大規(guī)模切分?jǐn)?shù)據(jù)難與共享,源領(lǐng)域與目標(biāo)領(lǐng)域數(shù)據(jù)混合需要重新訓(xùn)練等問題。對(duì)目標(biāo)領(lǐng)域進(jìn)行分詞時(shí),通過模型的自適應(yīng)能力實(shí)現(xiàn)領(lǐng)域自適應(yīng)。特征嵌入的表示方法能夠有效地解決特征稀疏問題,本文采用特征嵌入來表示輸入特征。實(shí)驗(yàn)結(jié)果表明,本文提出的分詞方法能夠明顯提高領(lǐng)域適應(yīng)性能力。最后,設(shè)計(jì)并實(shí)現(xiàn)了領(lǐng)域自適應(yīng)中文分詞系統(tǒng)。該系統(tǒng)可以實(shí)現(xiàn)利用已有的基礎(chǔ)模型對(duì)輸入的句子或文本進(jìn)行分詞,并且支持添加相關(guān)領(lǐng)域詞典,還可根據(jù)待分詞領(lǐng)域訓(xùn)練數(shù)據(jù)對(duì)基礎(chǔ)模型進(jìn)行更新,從而獲得相關(guān)領(lǐng)域較好的分詞結(jié)果。
[Abstract]:Chinese word segmentation refers to the process of dividing consecutive word sequences into rational word sequences according to specific norms. As the most basic step of natural language processing, information retrieval is one of the most important steps in Chinese word segmentation. Knowledge acquisition and machine translation and other applications must deal with the key links. It is of great theoretical and practical significance to study Chinese word segmentation. In this paper, a word-based multi-model word segmentation method is proposed. The neural network model structure is used to build a model for each word. Body with semantic information. Different words have different meanings and functions in different contexts, which result in different word-formation rules. This method can effectively distinguish the influence of each feature on the different words to be segmented, so as to learn the special rule of word-formation, and compare it with the single model method / CRF method and the previous work. The word-based multi-model method proposed in this paper has achieved better word segmentation effect, and has been applied on the simplified Chinese corpus PKU and MSR provided by SIGHAN Backoff2005. The F values obtained are 93.4% and 95. 5 respectively. According to the above method and domain adaptive word segmentation task, this paper proposes a word-based domain adaptive word segmentation method, because the word model is independent of each other. When the model is updated, the word model with strong migration performance is retained, and the word model with weak migration performance is updated and trained, which solves the difficulty and sharing of large-scale segmentation data. The source domain and target domain data mix need to be retrained and so on. When carries on the word segmentation to the target domain. The domain adaptation is realized by the adaptive ability of the model. The representation method of feature embedding can effectively solve the problem of feature sparsity. In this paper, feature embedding is used to represent input features. The experimental results show that. The word segmentation method proposed in this paper can obviously improve the adaptability of the field. Finally. The domain adaptive Chinese word segmentation system is designed and implemented. The system can use the existing basic model to segment the input sentence or text, and support the addition of relevant domain dictionaries. In addition, the basic model can be updated according to the training data of the domain to be partitioned, and the better segmentation results can be obtained.
【學(xué)位授予單位】:沈陽航空航天大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 韓冰;劉一佳;車萬翔;劉挺;;基于感知器的中文分詞增量訓(xùn)練方法研究[J];中文信息學(xué)報(bào);2015年05期
2 韓冬煦;常寶寶;;中文分詞模型的領(lǐng)域適應(yīng)性方法[J];計(jì)算機(jī)學(xué)報(bào);2015年02期
3 周俊;鄭中華;張煒;;基于改進(jìn)最大匹配算法的中文分詞粗分方法[J];計(jì)算機(jī)工程與應(yīng)用;2014年02期
4 來斯惟;徐立恒;陳玉博;劉康;趙軍;;基于表示學(xué)習(xí)的中文分詞算法探索[J];中文信息學(xué)報(bào);2013年05期
5 張梅山;鄧知龍;車萬翔;劉挺;;統(tǒng)計(jì)與詞典相結(jié)合的領(lǐng)域自適應(yīng)中文分詞[J];中文信息學(xué)報(bào);2012年02期
6 黃德根;焦世斗;周惠巍;;基于子詞的雙層CRFs中文分詞[J];計(jì)算機(jī)研究與發(fā)展;2010年05期
7 張桂平;劉東生;尹寶生;徐立軍;苗雪雷;;面向?qū)@墨I(xiàn)的中文分詞技術(shù)的研究[J];中文信息學(xué)報(bào);2010年03期
8 宋彥;蔡?hào)|風(fēng);張桂平;趙海;;一種基于字詞聯(lián)合解碼的中文分詞方法[J];軟件學(xué)報(bào);2009年09期
9 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期
10 秦穎;王小捷;張素香;;漢語分詞中組合歧義字段的研究[J];中文信息學(xué)報(bào);2007年01期
,本文編號(hào):1466711
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1466711.html