基于詞向量表征的新詞發(fā)現(xiàn)及命名實(shí)體識(shí)別研究
[Abstract]:The mining analysis of structured data in data mining is relatively mature, but unstructured data mining analysis faces many challenges. Text data is a very important kind of unstructured data. The mining and analysis of this kind of data types face more challenges, such as Chinese word segmentation, named entity recognition, entity relation extraction, semantic understanding. Emotional analysis and a series of questions. Word segmentation is the basic step of most Chinese text data mining and analysis. However, because people are constantly creating new words, these new words can not be completely included, so it will lead to participle errors, which will lead to the tagging errors of named entities. Therefore, neologism recognition has become a difficult and bottleneck problem in text mining. In recent years, word vector representation obtained by using neural network or in-depth learning training language model can well represent the semantic relationship between words and words. Inspired by this, this paper applies this word vector representation to Chinese new word discovery and recognition. An unsupervised new word discovery method based on word vector representation and n-gram is proposed. Firstly, by training the neural network language model to map words to a high-dimensional space, and comparing the word vectors obtained by Skip-gram model and CBOW model, we find that the Skip-gram model can achieve better results. Secondly, if several adjacent words often appear together in different word sequences, then they must have some relationship. Inspired by the association rule algorithm, an efficient n-gram mining algorithm is designed in this paper. The extracted n-gram is regarded as a new word candidate string. Then, the trained word vector is used to prune the candidate word string and eliminate the noise data, and the result of the new word is obtained. This paper also designs pruning algorithm and compares the effects of different vector similarity measures on the final results. It is found that the effect of cosine similarity pruning is the best. At the same time, this paper also makes the corresponding comparison with other new word discovery methods, which proves the effectiveness of this method. Finally, on the basis of the results of the new words, we use conditional random field to classify the results, so as to realize the recognition of named entity words. The main contributions of this paper are as follows: (1) the neural network trained word vector is introduced in the field of Chinese new word recognition, which combines word vector with n-gram. A new unsupervised new word recognition method is proposed. (2) based on the discovery of new words, the conditional random field is used to classify the new words and identify the named entity words, which provides a new practice for naming entity recognition.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 杜麗萍;李曉戈;于根;劉春麗;劉睿;;基于互信息改進(jìn)算法的新詞發(fā)現(xiàn)對(duì)中文分詞系統(tǒng)改進(jìn)[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2016年01期
2 陳飛;劉奕群;魏超;張?jiān)屏?張敏;馬少平;;基于條件隨機(jī)場(chǎng)方法的開放領(lǐng)域新詞發(fā)現(xiàn)[J];軟件學(xué)報(bào);2013年05期
3 崔世起;劉群;孟遙;于浩;西野文人;;基于大規(guī)模語(yǔ)料庫(kù)的新詞檢測(cè)[J];計(jì)算機(jī)研究與發(fā)展;2006年05期
4 鄒綱,劉洋,劉群,孟遙,于浩,西野文人,亢世勇;面向Internet的中文新詞語(yǔ)檢測(cè)[J];中文信息學(xué)報(bào);2004年06期
5 張華平,劉群;基于角色標(biāo)注的中國(guó)人名自動(dòng)識(shí)別研究[J];計(jì)算機(jī)學(xué)報(bào);2004年01期
6 鄭家恒,李文花;基于構(gòu)詞法的網(wǎng)絡(luò)新詞自動(dòng)識(shí)別初探[J];山西大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年02期
7 王寧,葛瑞芳,苑春法,黃錦輝,李文捷;中文金融新聞中公司名的識(shí)別[J];中文信息學(xué)報(bào);2002年02期
,本文編號(hào):2387406
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2387406.html