天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于深度學習的中文網絡衍生實體的識別與分類

發(fā)布時間:2019-04-16 08:40
【摘要】:隨著互聯(lián)網信息內容的爆炸,網絡上充斥著大量的近音詞、縮略語、同義詞等非規(guī)范的中文表達。由于中文在組織與使用上的靈活性,大量的文本主體詞采用這些形式的衍生詞進行表達,這類主體詞被稱為網絡衍生實體。由于中文網絡衍生實體復雜多變,難以識別,并且常常被用來替換原詞語以規(guī)避政府的網絡輿情監(jiān)管,因此給自然語言處理及輿情監(jiān)控帶來了諸多困難。針對特定類別的衍生實體識別,雖然國內外學者已有廣泛的探討和研究,卻至今沒有對網絡衍生實體的整體數據分布進行研究;并且,大量的新的衍生實體不斷出現(xiàn),對網絡衍生實體的識別技術提出了新的要求。本文的主要工作如下:1)分別針對各類衍生實體的識別,對國內外的解決方法進行了研究和對比,分析了近年來主流識別模型的方法和技術的發(fā)展趨勢;通過對各方法的分析與總結,指出各方法在實際應用中的優(yōu)劣之處;同時,結合本文所研究的問題的特點,提出采用基于深度學習的方法進行中文網絡衍生實體識別的新思路。2)提出了兩種用于中文網絡衍生實體識別的神經網絡架構:滑動窗口法和句子卷積法,從而解決了文本中句子長度不統(tǒng)一、無法輸入神經網絡的問題;采用word2vec技術獲取模型輸入向量;同時,采用棧式自編碼器編碼人工特征向量,組成復合輸入以進一步提高模型的識別效果;通過采用特殊的激活函數和訓練算法,加速了模型的訓練過程,進一步優(yōu)化了模型的結構。3)在構建的語料庫基礎上,進行了大量的對比實驗。由于缺少開放語料庫,本文采用Scrapy爬蟲框架進行語料的抓取(語料大小為252.3MB),并且通過人工標注,完成了語料庫的構建;針對該語料庫,進行了大量的衍生實體識別測試,并比較了模型在各類實體識別上的結果差異;實驗結果表明,本文所提出的兩種模型框架,均能夠有效地應對網絡衍生實體識別的問題,其性能指標F1值分別為78.6%和76.9%,并在各類實體的識別上各有所長,其結果均優(yōu)于采用傳統(tǒng)模型在該語料集上的識別效果;同時,通過研究不同參數、不同方法對實驗結果的影響,得到了關于該模型的更一般的調參經驗,為其他研究人員提供了參考。實踐表明,本文所提出的基于深度學習的神經網絡實體識別模型,可以很好地應用于中文網絡衍生實體的識別任務上來。該模型可以同時對各類衍生實體得到較好的識別性能,能夠滿足大數據背景下中文網絡衍生實體識別的新需求。
[Abstract]:With the explosion of Internet information content, the network is full of non-standard Chinese expressions such as close words, acronyms, synonyms and so on. Due to the flexibility in the organization and use of Chinese, a large number of text subject words are expressed by these forms of derivative words, which are called network-derived entities. Due to the complexity and variety of Chinese Internet derivative entities, which are difficult to identify, and are often used to replace the original words in order to evade the government's network public opinion supervision, it has brought many difficulties to natural language processing and public opinion monitoring. In view of the specific categories of derivative entity recognition, although domestic and foreign scholars have been extensively discussed and studied, there is no research on the overall data distribution of the network derivative entity up to now. Moreover, a large number of new derivative entities appear constantly, which puts forward new requirements for the identification technology of network derivative entities. The main work of this paper is as follows: 1) according to the identification of various derivative entities, this paper studies and compares the solutions at home and abroad, and analyzes the development trend of the mainstream identification model methods and technologies in recent years; Through the analysis and summary of each method, the advantages and disadvantages of each method in practical application are pointed out. At the same time, combined with the characteristics of the problems studied in this paper, A new idea of Chinese network derived entity recognition based on deep learning is proposed. 2) two neural network structures for Chinese network derived entity recognition are proposed: sliding window method and sentence convolutional method. Thus it solves the problem that sentence length is not uniform and can not be inputted into neural network in the text. The word2vec technology is used to obtain the input vector of the model, and the stack self-encoder is used to encode the artificial feature vector to make up the compound input to further improve the recognition effect of the model. Through the use of special activation function and training algorithm, the training process of the model is accelerated and the structure of the model is further optimized. 3) on the basis of the corpus, a lot of comparative experiments are carried out. Because of the lack of open corpus, this paper uses the Scrapy crawler framework to capture the corpus (the size of the corpus is 252.3MB), and completes the construction of the corpus through manual tagging. Based on the corpus, a large number of derived entity recognition tests are carried out, and the results of the model on various entity recognition are compared. The experimental results show that the two models proposed in this paper can effectively deal with the problem of identification of network derived entities, and their performance indices F1 are 78.6% and 76.9%, respectively, and have their own advantages in the identification of all kinds of entities. The results are better than the traditional models in the recognition of the corpus. At the same time, by studying the influence of different parameters and methods on the experimental results, more general experience of adjusting parameters for the model is obtained, which provides reference for other researchers. The practice shows that the neural network entity recognition model based on deep learning proposed in this paper can be applied to the identification task of Chinese network derived entities. This model can identify all kinds of derivative entities at the same time, and can meet the new requirements of Chinese network derived entity recognition under the background of big data.
【學位授予單位】:武漢大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1

【參考文獻】

相關期刊論文 前6條

1 郗亞輝;;產品評論挖掘中特征同義詞的識別[J];中文信息學報;2016年04期

2 張燕;張揚;孫茂松;;基于中文拼音輸入法數據的漢語方言詞匯自動識別[J];中文信息學報;2013年05期

3 彭春艷;張暉;包玲玉;陳昌平;;基于條件隨機域的生物命名實體識別[J];計算機工程;2009年22期

4 陸勇,侯漢清;用于信息檢索的同義詞自動識別及其進展[J];南京農業(yè)大學學報(社會科學版);2004年03期

5 張華平,劉群;基于角色標注的中國人名自動識別研究[J];計算機學報;2004年01期

6 周強;;基于語料庫和面向統(tǒng)計學的自然語言處理技術[J];計算機科學;1995年04期

,

本文編號:2458638

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/2458638.html


Copyright(c)文論論文網All Rights Reserved | 網站地圖 |

版權申明:資料由用戶17d26***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com