基于神經(jīng)網(wǎng)絡(luò)的詞法分析研究
發(fā)布時間:2018-03-08 01:22
本文選題:中文分詞 切入點:詞性標(biāo)注 出處:《南京大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:詞法分析是自然語言處理領(lǐng)域中一項重要的基礎(chǔ)任務(wù)。詞法分析任務(wù)由中文分詞和詞性標(biāo)注這兩個基本任務(wù)組成。分詞是一種將中文字串轉(zhuǎn)換為中文詞串的任務(wù)。對于中文文本分析來說,幾乎所有的任務(wù)都依賴于分詞。詞性標(biāo)注是給組成句子的每一個詞指定一個詞性類別的任務(wù)。對于句法分析,語義分析等高層次任務(wù)來說,詞性可以幫助消解歧義,緩解詞特征的稀疏性。詞法分析任務(wù)雖然比較基礎(chǔ),但是具有著非常廣泛的需求和應(yīng)用前景,目前仍是自然語言處理領(lǐng)域中的熱點問題。中文分詞技術(shù)在早期由于計算資源有限以及缺乏標(biāo)注語料,一般采用基于詞典的規(guī)則方法。隨著計算能力的增長以及標(biāo)注語料的出現(xiàn),中文分詞的處理技術(shù)慢慢從規(guī)則方法轉(zhuǎn)移到基于機器學(xué)習(xí)的方法,其中字標(biāo)注方法是目前解決分詞問題最常使用的手段。在深度學(xué)習(xí)興起之后,也有一些研究者嘗試?yán)蒙窠?jīng)網(wǎng)絡(luò)解決分詞問題,取得了一些進展。詞性標(biāo)注任務(wù)也存在著類似的研究路徑。在本文中,首先針對傳統(tǒng)基于字標(biāo)注的分詞模型基于窗口抽取局部特征,無法捕獲長距離依賴的問題,我們提出使用雙向長短期記憶網(wǎng)絡(luò)代替原有特征抽取模塊,該網(wǎng)絡(luò)既可以保存長距離信息也簡化了特征抽取工作。其次,我們設(shè)計了基于雙向長短期記憶網(wǎng)絡(luò)的貪心模型和結(jié)構(gòu)化模型。最后我們針對通用的詞嵌入與具體任務(wù)不契合的問題,我們分別設(shè)計了分詞和詞性標(biāo)注任務(wù)相關(guān)的詞嵌入模型。實驗結(jié)果表明,基于雙向長短期記憶神經(jīng)網(wǎng)絡(luò)的分詞模型取得了和傳統(tǒng)模型相當(dāng)?shù)男Ч?而且簡單快速的貪心模型與結(jié)構(gòu)化模型性能相當(dāng);在加入WCC(Word-context Character Embedding)模型預(yù)訓(xùn)練的字嵌入后,在標(biāo)準(zhǔn)數(shù)據(jù)集上取得了當(dāng)前最佳或相當(dāng)?shù)男阅?在領(lǐng)域遷移試驗中也取得了不錯的效果。對于詞性標(biāo)注模型,在加入PCS(POS Sensitive Embedding)模型預(yù)訓(xùn)練的詞嵌入后,提升了標(biāo)注系統(tǒng)的能力,并且PCS模型可以快速利用異構(gòu)數(shù)據(jù)提高模型性能。
[Abstract]:Lexical analysis is an important basic task in the field of natural language processing. Lexical analysis task consists of two basic tasks: Chinese word segmentation and part of speech tagging. Word segmentation is a task of converting Chinese string into Chinese string. For Chinese text analysis, Almost all tasks depend on participle. Part of speech tagging is the task of assigning a part of speech category to each word that makes up a sentence. For high-level tasks such as syntactic analysis, semantic analysis, and so on, part of speech can help to resolve ambiguity. Although the lexical analysis task is relatively basic, it has a very wide range of needs and application prospects. At present, Chinese word segmentation is still a hot topic in the field of natural language processing. With the increase of computing power and the appearance of tagging corpus, the processing technology of Chinese word segmentation is gradually transferred from rule method to machine learning method. Word tagging is the most commonly used method to solve word segmentation problem. After the rise of in-depth learning, some researchers also try to use neural network to solve word segmentation problem. Some progress has been made. Part of speech tagging task also has a similar research path. In this paper, firstly, aiming at the problem of extracting local features based on window in traditional word segmentation model based on word tagging, we can not capture long distance dependence. We propose to use bidirectional long and short term memory network instead of the original feature extraction module. This network can not only save the long distance information but also simplify the feature extraction work. Secondly, We design a greedy model and a structured model based on a bidirectional short and long term memory network. Finally, we aim at the problem of mismatch between general word embedding and specific tasks. We have designed word embedding models related to word segmentation and part of speech tagging task respectively. The experimental results show that the segmentation model based on bi-directional long-term and short-term memory neural network has the same effect as the traditional model. And the performance of the simple and fast greedy model is comparable to that of the structured model; after the word embedding of the pre-trained WCC(Word-context Character embedding model is added, the best or equivalent performance is achieved on the standard data set. For the part of speech tagging model, the ability of the tagging system can be improved by adding the pre-trained word embedding of the PCS(POS Sensitive embed model, and the PCS model can quickly improve the performance of the model by using heterogeneous data.
【學(xué)位授予單位】:南京大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1;TP183
【參考文獻】
相關(guān)期刊論文 前3條
1 陳明華;殷景華;舒昌;王明江;;基于正反向最大匹配分詞系統(tǒng)的實現(xiàn)[J];信息技術(shù);2009年06期
2 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報;2007年03期
3 張華平,劉群;基于N-最短路徑方法的中文詞語粗分模型[J];中文信息學(xué)報;2002年05期
,本文編號:1581807
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1581807.html
最近更新
教材專著