地名本體實(shí)體與關(guān)系抽取研究

發(fā)布時(shí)間：2018-08-01 13:54

【摘要】：近年來,突發(fā)事件頻頻發(fā)生。應(yīng)急管理的重要性越來越突出。應(yīng)急管理的過程中涉及多方面數(shù)據(jù)的融合。如何快速、準(zhǔn)確的提供相關(guān)的數(shù)據(jù)是急需研究的問題。隨著互聯(lián)網(wǎng)的發(fā)展,網(wǎng)絡(luò)上的數(shù)據(jù)呈指數(shù)級(jí)增長,這些數(shù)據(jù)中包含了很多應(yīng)急管理需要的信息。地名信息是應(yīng)急信息的核心支撐點(diǎn)。本文進(jìn)行地名本體實(shí)體和關(guān)系抽取研究,抽取地名相關(guān)的實(shí)體和實(shí)體間的關(guān)系,為應(yīng)急數(shù)據(jù)的抽取和語義化奠定核心基礎(chǔ)。實(shí)體和關(guān)系的抽取屬于自然語言處理中的命名實(shí)體識(shí)別和關(guān)系抽取。目前主流的方法有基于規(guī)則的方法和基于機(jī)器學(xué)習(xí)的方法。本文在抽取的過程中根據(jù)原始文本中實(shí)體和關(guān)系的特點(diǎn)分別因地制宜地采取了基于規(guī)則和基于機(jī)器學(xué)習(xí)的方法。由于業(yè)界沒有建立好的地名領(lǐng)域抽取的語料庫,本文首先建立了地名本體抽取的實(shí)體體系和關(guān)系體系,然后根據(jù)抽取過程中關(guān)注的特征建立實(shí)體抽取和關(guān)系抽取所需要的語料,詳細(xì)介紹了語料庫構(gòu)建的過程。對地名本體實(shí)體根據(jù)其在原始文本中出現(xiàn)的規(guī)律進(jìn)行了分類,分別采用基于規(guī)則的方法和利用最大熵進(jìn)行機(jī)器學(xué)習(xí)的方法。首先總結(jié)了四類地名本體實(shí)體的抽取規(guī)則,然后對于其他的幾類地名本體實(shí)體,首先對機(jī)器學(xué)習(xí)過程中使用的特征進(jìn)行了分析,基于標(biāo)注的語料,利用最大熵進(jìn)行了地名實(shí)體的抽取。對于關(guān)系的抽取,首先分析了關(guān)系的特點(diǎn),采用基于特征向量的方法,利用SVM進(jìn)行關(guān)系的抽取。根據(jù)語料的特點(diǎn),提出了基于規(guī)則的方法抽取地名本體的關(guān)系。同時(shí),分析了關(guān)系的特點(diǎn),制定了相關(guān)的規(guī)則,從已有的關(guān)系出發(fā),推導(dǎo)出隱含的關(guān)系,進(jìn)一步豐富地名本體關(guān)系庫。最后,設(shè)計(jì)和實(shí)現(xiàn)了地名本體實(shí)體和關(guān)系抽取平臺(tái),并將抽取的數(shù)據(jù)應(yīng)用到了實(shí)際的語義地名搜索引擎中,實(shí)踐證明,抽取的實(shí)體和關(guān)系數(shù)據(jù)很大程度上提升了用戶體驗(yàn),幫助了用戶更方便、更迅速、更準(zhǔn)確的地名相關(guān)數(shù)據(jù)。
[Abstract]:In recent years, emergencies occur frequently. The importance of emergency management is becoming more and more prominent. The process of emergency management involves the fusion of many aspects of data. How to provide relevant data quickly and accurately is an urgent problem. With the development of the Internet, the data on the network increase exponentially, which contains a lot of information needed for emergency management. Toponymic information is the core support of emergency information. In this paper, the ontology and relation extraction of geographical names is carried out to extract the relationship between entities and entities, which lays the core foundation for the extraction and semantics of emergency data. The extraction of entities and relationships belongs to named entity identification and relation extraction in natural language processing. At present, the mainstream methods are rule-based approach and machine-based learning method. According to the characteristics of entities and relationships in the original text, this paper adopts rule-based and machine-learning methods in the process of extraction, respectively. Because there is no good corpus for toponymic domain extraction, this paper first establishes the entity system and relational system of toponymic ontology extraction, and then establishes the corpus needed for entity extraction and relational extraction according to the features concerned in the extraction process. The construction process of corpus is introduced in detail. The ontology entities of geographical names are classified according to their rules in the original text, respectively, which are based on rules and machine learning methods using maximum entropy. Firstly, the extraction rules of four kinds of toponymic ontology entities are summarized, then the features used in the machine learning process are analyzed for several other toponymic ontology entities, which are based on annotated corpus. The maximum entropy is used to extract geographical names. For the extraction of relationships, the characteristics of the relationships are analyzed, and the feature vector based method is used to extract the relationships using SVM. According to the characteristics of corpus, a rule-based method is proposed to extract the relation of geographical names ontology. At the same time, the characteristics of the relationship are analyzed, and the relevant rules are made. Based on the existing relations, the implicit relationship is derived, which further enriches the ontology relation database of geographical names. Finally, the ontology entity and relational extraction platform are designed and implemented, and the extracted data are applied to the actual semantic toponymic search engine. The practice shows that the extracted entity and relational data greatly improve the user experience. Help users to more convenient, faster, more accurate place name related data.
【學(xué)位授予單位】：天津大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前5條

1 周俊生;戴新宇;尹存燕;陳家駿;;基于層疊條件隨機(jī)場模型的中文機(jī)構(gòu)名自動(dòng)識(shí)別[J];電子學(xué)報(bào);2006年05期

2 劉克彬;李芳;劉磊;韓穎;;基于核函數(shù)中文關(guān)系自動(dòng)抽取系統(tǒng)的實(shí)現(xiàn)[J];計(jì)算機(jī)研究與發(fā)展;2007年08期

3 蔣方玲;王文俊;楊鵬;徐佳佳;;中文地名本體模型研究[J];計(jì)算機(jī)工程與應(yīng)用;2011年25期

4 王寧,葛瑞芳,苑春法,黃錦輝,李文捷;中文金融新聞中公司名的識(shí)別[J];中文信息學(xué)報(bào);2002年02期

5 董靜;孫樂;馮元勇;黃瑞紅;;中文實(shí)體關(guān)系抽取中的特征選擇研究[J];中文信息學(xué)報(bào);2007年04期

相關(guān)碩士學(xué)位論文前1條

1 張志田;無監(jiān)督關(guān)系抽取方法研究[D];哈爾濱工業(yè)大學(xué);2007年

，

本文編號(hào)：2157790

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2157790.html

上一篇：基于搜索引擎的消費(fèi)者使用行為特征分析及啟示
下一篇：因特網(wǎng)上信息搜集初探

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

地名本體實(shí)體與關(guān)系抽取研究