天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

文本挖掘中的中文實(shí)體關(guān)系抽取

發(fā)布時間:2018-05-09 03:10

  本文選題:關(guān)系抽取 + 實(shí)體識別; 參考:《北京郵電大學(xué)》2013年碩士論文


【摘要】:當(dāng)今社會,隨著科技的迅速發(fā)展,數(shù)據(jù)特別是網(wǎng)絡(luò)數(shù)據(jù)正以指數(shù)規(guī)律飛速地增長。而作為網(wǎng)絡(luò)數(shù)據(jù)中非常重要的一部分,文本數(shù)據(jù)受到了相當(dāng)大的重視。為了應(yīng)對海量文本數(shù)據(jù)帶來的挑戰(zhàn),有效地存儲、管理以至于利用文本數(shù)據(jù),人們迫切地需要一些能夠在海量信息源中迅速找到真正需要信息的自動化工具。信息抽取(Information Extraction)的研究正是為了解決這個問題。 信息抽取,是從結(jié)構(gòu)化或者半結(jié)構(gòu)化的文本自動動抽取特定信息,并以結(jié)構(gòu)化的形式(例如數(shù)據(jù)庫或者XML文檔)存儲。信息抽取任務(wù)一般都會包含了兩個緊密相連的任務(wù):命名實(shí)體識別和實(shí)體關(guān)系抽取。本文主要研究的就是基于網(wǎng)絡(luò)數(shù)據(jù)的實(shí)體關(guān)系抽取系統(tǒng),即如何獲取兩個命名實(shí)體之間關(guān)系的問題。主要包括: 1.根據(jù)網(wǎng)絡(luò)數(shù)據(jù)的特點(diǎn),設(shè)計了基礎(chǔ)數(shù)據(jù)收集的相關(guān)方案。該方案允分利用了網(wǎng)絡(luò)數(shù)據(jù)的特點(diǎn)以及搜索引擎的功能,并結(jié)合頁面結(jié)構(gòu)的整體特性,達(dá)到了以較低的成本,簡潔方便地獲取大量相關(guān)的網(wǎng)絡(luò)資源,并抽取中其中的正文文本 2.深入研究了當(dāng)前主流的關(guān)系抽取的方法,并對各種方法的優(yōu)缺點(diǎn)進(jìn)行分析,并在此基礎(chǔ)上提出了一種關(guān)系抽取的方法。該方法同時結(jié)合了語句的結(jié)構(gòu)關(guān)系以及詞語特性兩方面的特征,有效地實(shí)現(xiàn)了對句子中實(shí)體關(guān)系的抽取。 3.在以上研究的基礎(chǔ)上,實(shí)現(xiàn)了從網(wǎng)絡(luò)數(shù)據(jù)收集到關(guān)系抽取的原型系統(tǒng)。該系統(tǒng)基于B/S框架,完成了本文提出的關(guān)系抽取算法,同時提供可視化的展示模塊,能夠在瀏覽器中直觀地展示關(guān)系抽取的相關(guān)結(jié)果。并利用此系統(tǒng)進(jìn)行了相關(guān)的實(shí)驗(yàn),驗(yàn)證了關(guān)系抽取算法的有效性。
[Abstract]:Nowadays, with the rapid development of science and technology, data, especially network data, are increasing exponentially. As a very important part of network data, text data has received considerable attention. In order to meet the challenge of mass text data and store, manage and utilize text data effectively, people urgently need some automation tools which can quickly find the real information in the mass information source. The research of information extraction is to solve this problem. Information extraction is to extract specific information automatically from structured or semi-structured text and store it in a structured form (such as a database or XML document). The task of information extraction usually includes two closely related tasks: named entity identification and entity relation extraction. This paper mainly studies the entity relation extraction system based on network data, that is, how to obtain the relationship between two named entities. These include: 1. According to the characteristics of network data, the related scheme of basic data collection is designed. This scheme makes use of the characteristics of network data and the function of search engine, and combines with the overall characteristics of page structure, achieves the goal of obtaining a large number of related network resources at a lower cost, and extracts the text of the text. 2. The main methods of relation extraction are deeply studied, and the advantages and disadvantages of these methods are analyzed. On the basis of this, a relational extraction method is proposed. This method combines the structural relation of sentence and the character of words and realizes the extraction of entity relation in sentence effectively. 3. Based on the above research, a prototype system from network data collection to relational extraction is implemented. Based on the B / S framework, the system completes the relational extraction algorithm proposed in this paper, and provides a visual display module, which can directly display the related results of the relational extraction in the browser. Experiments are carried out with this system to verify the effectiveness of the relational extraction algorithm.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 孫宏林,俞士汶;淺層句法分析方法概述[J];當(dāng)代語言學(xué);2000年02期

2 龍樹全;趙正文;唐華;;中文分詞算法概述[J];電腦知識與技術(shù);2009年10期

3 馮元勇;孫樂;張大鯤;李文波;;基于小規(guī)模尾字特征的中文命名實(shí)體識別研究[J];電子學(xué)報;2008年09期

4 齊鵬;李隱峰;宋玉偉;;基于Python的Web數(shù)據(jù)采集技術(shù)[J];電子科技;2012年11期

5 劉克彬;李芳;劉磊;韓穎;;基于核函數(shù)中文關(guān)系自動抽取系統(tǒng)的實(shí)現(xiàn)[J];計算機(jī)研究與發(fā)展;2007年08期

6 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計算機(jī)工程與應(yīng)用;2003年10期

7 王利;劉宗田;王燕華;廖濤;;基于內(nèi)容相似度的網(wǎng)頁正文提取[J];計算機(jī)工程;2010年06期

8 黃高輝;姚天f ;劉全升;;基于CRF算法的漢語比較句識別和關(guān)系抽取[J];計算機(jī)應(yīng)用研究;2010年06期

9 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報;2004年05期

10 車萬翔,劉挺,李生;實(shí)體關(guān)系自動抽取[J];中文信息學(xué)報;2005年02期

,

本文編號:1864300

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1864300.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4711a***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com