天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

基于詞向量表征的新詞發(fā)現(xiàn)及命名實(shí)體識(shí)別研究

發(fā)布時(shí)間:2018-12-19 20:27
【摘要】:數(shù)據(jù)挖掘中結(jié)構(gòu)化數(shù)據(jù)的挖掘分析相對(duì)成熟,但非結(jié)構(gòu)化的數(shù)據(jù)挖掘分析面臨許多挑戰(zhàn)。文本數(shù)據(jù)是一種非常重要的非結(jié)構(gòu)化數(shù)據(jù),對(duì)于該種數(shù)據(jù)類型的挖掘分析面臨著更多的挑戰(zhàn),主要面臨如中文分詞、命名實(shí)體識(shí)別、實(shí)體關(guān)系抽取、語(yǔ)義理解,情感分析等等一系列的問題。其中,分詞技術(shù)幾乎是絕大多數(shù)中文文本數(shù)據(jù)挖掘分析的基礎(chǔ)步驟。然而,由于人們總是在不斷地創(chuàng)造新的詞匯,這些新詞是不可能被人們完全收錄,所以會(huì)導(dǎo)致分詞錯(cuò)誤,從而引致命名實(shí)體的標(biāo)記錯(cuò)誤。因此,新詞識(shí)別已經(jīng)成為文本挖掘的一個(gè)難點(diǎn)和瓶頸問題。近幾年利用神經(jīng)網(wǎng)絡(luò)或深度學(xué)習(xí)訓(xùn)練語(yǔ)言模型而得到的詞向量表征能夠很好的表征詞與詞之間的語(yǔ)義關(guān)系,受此啟發(fā),本文把這種詞向量表征用于中文的新詞發(fā)現(xiàn)識(shí)別中,提出了一個(gè)基于詞向量表征和n-gram相結(jié)合的無監(jiān)督的新詞發(fā)現(xiàn)方法。首先,本文通過訓(xùn)練神經(jīng)網(wǎng)絡(luò)語(yǔ)言模型把詞映射到一個(gè)高維空間,并且對(duì)比了Skip-gram模型和CBOW模型得到的詞向量對(duì)新詞結(jié)果的影響,發(fā)現(xiàn)Skip-gram模型能夠取得更好效果。其次,考慮到如果幾個(gè)相鄰的詞經(jīng)常的共同出現(xiàn)在不同的詞序列中,那么他們一定存在某種關(guān)系。本文受關(guān)聯(lián)規(guī)則算法的啟發(fā),設(shè)計(jì)了高效的n-gram挖掘算法,把挖掘出的n-gram作為新詞候選詞串。接著,本文利用訓(xùn)練好的詞向量對(duì)候選詞串進(jìn)行剪枝,剔除噪音數(shù)據(jù),從而得到新詞結(jié)果。本文還設(shè)計(jì)了剪枝算法,并且對(duì)比了不同向量相似性度量方法對(duì)最終結(jié)果的影響,發(fā)現(xiàn)余弦相似性剪枝效果最好。同時(shí),本文也和其他新詞發(fā)現(xiàn)方法做了相應(yīng)對(duì)比,證實(shí)了本文方法的有效性。最后,本文在新詞結(jié)果的基礎(chǔ)上,進(jìn)一步利用條件隨機(jī)場(chǎng)對(duì)結(jié)果進(jìn)行分類,從而實(shí)現(xiàn)命名實(shí)體詞的識(shí)別。本文的主要貢獻(xiàn)為:(1)在中文新詞識(shí)別領(lǐng)域引入了神經(jīng)網(wǎng)絡(luò)訓(xùn)練的詞向量,把詞向量和n-gram相結(jié)合,提出了一種新的無監(jiān)督的新詞識(shí)別方法。(2)在新詞發(fā)現(xiàn)的基礎(chǔ)上利用條件隨機(jī)場(chǎng)對(duì)新詞進(jìn)行分類并識(shí)別出其中的命名實(shí)體詞,為命名實(shí)體識(shí)別提出了一種新的實(shí)踐。
[Abstract]:The mining analysis of structured data in data mining is relatively mature, but unstructured data mining analysis faces many challenges. Text data is a very important kind of unstructured data. The mining and analysis of this kind of data types face more challenges, such as Chinese word segmentation, named entity recognition, entity relation extraction, semantic understanding. Emotional analysis and a series of questions. Word segmentation is the basic step of most Chinese text data mining and analysis. However, because people are constantly creating new words, these new words can not be completely included, so it will lead to participle errors, which will lead to the tagging errors of named entities. Therefore, neologism recognition has become a difficult and bottleneck problem in text mining. In recent years, word vector representation obtained by using neural network or in-depth learning training language model can well represent the semantic relationship between words and words. Inspired by this, this paper applies this word vector representation to Chinese new word discovery and recognition. An unsupervised new word discovery method based on word vector representation and n-gram is proposed. Firstly, by training the neural network language model to map words to a high-dimensional space, and comparing the word vectors obtained by Skip-gram model and CBOW model, we find that the Skip-gram model can achieve better results. Secondly, if several adjacent words often appear together in different word sequences, then they must have some relationship. Inspired by the association rule algorithm, an efficient n-gram mining algorithm is designed in this paper. The extracted n-gram is regarded as a new word candidate string. Then, the trained word vector is used to prune the candidate word string and eliminate the noise data, and the result of the new word is obtained. This paper also designs pruning algorithm and compares the effects of different vector similarity measures on the final results. It is found that the effect of cosine similarity pruning is the best. At the same time, this paper also makes the corresponding comparison with other new word discovery methods, which proves the effectiveness of this method. Finally, on the basis of the results of the new words, we use conditional random field to classify the results, so as to realize the recognition of named entity words. The main contributions of this paper are as follows: (1) the neural network trained word vector is introduced in the field of Chinese new word recognition, which combines word vector with n-gram. A new unsupervised new word recognition method is proposed. (2) based on the discovery of new words, the conditional random field is used to classify the new words and identify the named entity words, which provides a new practice for naming entity recognition.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前7條

1 杜麗萍;李曉戈;于根;劉春麗;劉睿;;基于互信息改進(jìn)算法的新詞發(fā)現(xiàn)對(duì)中文分詞系統(tǒng)改進(jìn)[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2016年01期

2 陳飛;劉奕群;魏超;張?jiān)屏?張敏;馬少平;;基于條件隨機(jī)場(chǎng)方法的開放領(lǐng)域新詞發(fā)現(xiàn)[J];軟件學(xué)報(bào);2013年05期

3 崔世起;劉群;孟遙;于浩;西野文人;;基于大規(guī)模語(yǔ)料庫(kù)的新詞檢測(cè)[J];計(jì)算機(jī)研究與發(fā)展;2006年05期

4 鄒綱,劉洋,劉群,孟遙,于浩,西野文人,亢世勇;面向Internet的中文新詞語(yǔ)檢測(cè)[J];中文信息學(xué)報(bào);2004年06期

5 張華平,劉群;基于角色標(biāo)注的中國(guó)人名自動(dòng)識(shí)別研究[J];計(jì)算機(jī)學(xué)報(bào);2004年01期

6 鄭家恒,李文花;基于構(gòu)詞法的網(wǎng)絡(luò)新詞自動(dòng)識(shí)別初探[J];山西大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年02期

7 王寧,葛瑞芳,苑春法,黃錦輝,李文捷;中文金融新聞中公司名的識(shí)別[J];中文信息學(xué)報(bào);2002年02期



本文編號(hào):2387406

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2387406.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶f9fe4***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com