面向民族信息資源領(lǐng)域的非結(jié)構(gòu)化數(shù)據(jù)語義關(guān)系挖掘
發(fā)布時間:2018-09-07 11:17
【摘要】:非結(jié)構(gòu)化的文本數(shù)據(jù)占了民族信息資源重要部分,如何對其充分開發(fā)利用并有效傳播,將對促進經(jīng)濟社會發(fā)展和民族間文化交流起到積極的推動作用。本文對民族信息資源領(lǐng)域中的非結(jié)構(gòu)化文本進行語義關(guān)系挖掘,對挖掘過程中的產(chǎn)生的關(guān)鍵問題進行研究,其主要研究內(nèi)容如下:1、對民族信息資源領(lǐng)域文本進行分詞,以基于字符串最大匹配的詞典分詞方法進行粗分,通過雙向字符串最大匹配算法進行交集型歧義識別,通過統(tǒng)計民族信息資源領(lǐng)域生語料庫來進行交集型歧義處理,并基于這些算法實現(xiàn)中文分詞器。2.針對民族信息資源領(lǐng)域文本中存在大量的領(lǐng)域詞匯,運用大規(guī)模領(lǐng)域語料庫來進行新詞識別,對其產(chǎn)生的多特征海量數(shù)據(jù)以及統(tǒng)計速度過慢的問題,提出了在Map Reduce并行計算模型下的基于N-Gram的海量語料庫多特征識別算法,該算法運用N-Gram算法進行候選詞識別,然后對卡方統(tǒng)計量和左右熵值以及詞頻等作為特征,在特征計算的過程中進行并行化改進,運用規(guī)則的方法識別是否是新詞,基于以上算法實現(xiàn)了對民族信息領(lǐng)域中的新詞識別。3.在識別民族信息資源領(lǐng)域中的相關(guān)命名實體后對其進行實體關(guān)系挖掘,由于預先設(shè)定完善的實體關(guān)系體系較為困難,同時制作大規(guī)模的關(guān)系標注語料庫非常困難,因此本文運用基于無監(jiān)督學習的開放式信息抽取方法對文本進行實體關(guān)系挖掘,設(shè)計實現(xiàn)了對民族信息領(lǐng)域中的命名實體進行關(guān)系挖掘的平臺。通過對民族信息資源領(lǐng)域的非結(jié)構(gòu)化數(shù)據(jù)語義關(guān)系挖掘,解決了民族資源管理與服務(wù)的問題。
[Abstract]:The unstructured text data occupies an important part of the national information resources. How to fully develop and utilize it and spread it effectively will play a positive role in promoting the economic and social development and cultural exchange among nationalities. In this paper, the semantic relationship of unstructured text in the field of national information resources is excavated, and the key problems in the process of mining are studied. The main research contents are as follows: 1, partitioning the text in the field of national information resources. The dictionary segmentation method based on the maximum matching of strings is used for coarse segmentation, the two-way maximum string matching algorithm is used to recognize the intersection ambiguity, and the cross-type ambiguity is processed by statistical corpus of the field of national information resources. And based on these algorithms to implement Chinese word segmentation. 2. In view of the existence of a large number of domain words in the text of the field of national information resources, a large scale domain corpus is used to identify the new words, and the problems of the large amount of data generated by them and the slow statistical speed are also discussed. In this paper, a multi-feature recognition algorithm of massive corpus based on N-Gram in Map Reduce parallel computing model is proposed. The algorithm uses N-Gram algorithm to recognize candidate words, and then uses chi-square statistics, left and right entropy and word frequency as features. In the process of feature calculation, the parallel improvement is carried out, and the rule method is used to recognize whether the new word is a new word. Based on the above algorithm, the recognition of new words in the field of national information is realized. 3. After identifying the related named entities in the field of national information resources, it is difficult to mine the entity relations, because it is difficult to set up the perfect entity relation system in advance, and it is very difficult to make the large-scale relational tagging corpus at the same time. Therefore, this paper uses the open information extraction method based on unsupervised learning to mine the entity relationship of text, and designs and implements the platform of relation mining for named entities in the field of national information. The problem of national resource management and service is solved by mining the semantic relationship of unstructured data in the field of national information resources.
【學位授予單位】:云南師范大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.1
本文編號:2228106
[Abstract]:The unstructured text data occupies an important part of the national information resources. How to fully develop and utilize it and spread it effectively will play a positive role in promoting the economic and social development and cultural exchange among nationalities. In this paper, the semantic relationship of unstructured text in the field of national information resources is excavated, and the key problems in the process of mining are studied. The main research contents are as follows: 1, partitioning the text in the field of national information resources. The dictionary segmentation method based on the maximum matching of strings is used for coarse segmentation, the two-way maximum string matching algorithm is used to recognize the intersection ambiguity, and the cross-type ambiguity is processed by statistical corpus of the field of national information resources. And based on these algorithms to implement Chinese word segmentation. 2. In view of the existence of a large number of domain words in the text of the field of national information resources, a large scale domain corpus is used to identify the new words, and the problems of the large amount of data generated by them and the slow statistical speed are also discussed. In this paper, a multi-feature recognition algorithm of massive corpus based on N-Gram in Map Reduce parallel computing model is proposed. The algorithm uses N-Gram algorithm to recognize candidate words, and then uses chi-square statistics, left and right entropy and word frequency as features. In the process of feature calculation, the parallel improvement is carried out, and the rule method is used to recognize whether the new word is a new word. Based on the above algorithm, the recognition of new words in the field of national information is realized. 3. After identifying the related named entities in the field of national information resources, it is difficult to mine the entity relations, because it is difficult to set up the perfect entity relation system in advance, and it is very difficult to make the large-scale relational tagging corpus at the same time. Therefore, this paper uses the open information extraction method based on unsupervised learning to mine the entity relationship of text, and designs and implements the platform of relation mining for named entities in the field of national information. The problem of national resource management and service is solved by mining the semantic relationship of unstructured data in the field of national information resources.
【學位授予單位】:云南師范大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.1
【相似文獻】
相關(guān)期刊論文 前2條
1 張全法,郭茂田;用于輸入民族信息的ActiveX控件的開發(fā)[J];鄭州大學學報(理學版);2004年03期
2 ;[J];;年期
相關(guān)會議論文 前1條
1 張巨齡;;民族信息傳播與社會和諧問題的思考[A];中國少數(shù)民族地區(qū)信息傳播與社會發(fā)展論叢(2010年刊)[C];2010年
相關(guān)碩士學位論文 前1條
1 黃鵬;面向民族信息資源領(lǐng)域的非結(jié)構(gòu)化數(shù)據(jù)語義關(guān)系挖掘[D];云南師范大學;2016年
,本文編號:2228106
本文鏈接:http://sikaile.net/jingjilunwen/jiliangjingjilunwen/2228106.html
最近更新
教材專著