Web實體活動與實體關系抽取研究
發(fā)布時間:2018-08-28 10:53
【摘要】:隨著互聯(lián)網(wǎng)技術的迅速發(fā)展,Web已經(jīng)成為一個巨大的數(shù)據(jù)源,擁有海量數(shù)據(jù)。如何高效、全面、準確的集成Web上有價值的信息,為市場情報分析、搜索引擎、智能問答等系統(tǒng)提供數(shù)據(jù)支持,豐富市場情報分析和智能問答等系統(tǒng)的知識庫,幫助完善分析推理的結果,使搜索引擎返回更加精準的檢索數(shù)據(jù),成為數(shù)據(jù)集成、信息檢索、自然語言理解等領域研究的熱點和難點。要集成Web數(shù)據(jù),首要問題是如何將Web上的無結構和半結構化數(shù)據(jù)通過信息抽取技術轉(zhuǎn)變?yōu)橛嬎銠C可讀的結構化數(shù)據(jù)。 Web數(shù)據(jù)具有大規(guī)模、異構性、自治性、分布式等特點,現(xiàn)有的信息抽取技術無法同時滿足高效、全面和準確的數(shù)據(jù)集成需求。一方面,在面對大規(guī)模、分布式的Web數(shù)據(jù)時,現(xiàn)有的信息抽取技術旨在高效的抽取Web上的命名實體、實體關系和實體屬性(數(shù)據(jù)對象),但是抽取方法受抽取對象領域的限制,抽取結果較為簡單,信息內(nèi)容不夠豐富:另一方面,面對異構性、自治性強的無結構化Web數(shù)據(jù),現(xiàn)有的信息抽取技術旨在抽取結果的準確性,抽取效率不能滿足大規(guī)模信息抽取的需要。 本文致力于研究Web信息抽取技術,目標在于在保障抽取結果準確率的前提下,面向大規(guī)模、異構性的Web數(shù)據(jù),充分挖掘Web上的有價值信息,豐富信息抽取的內(nèi)容。Web上存在大量描述實體行為活動的數(shù)據(jù),現(xiàn)有的信息抽取技術未能詳細刻畫和抽取實體活動這一類特殊信息;面對大規(guī)模Web數(shù)據(jù),現(xiàn)有的關系抽取技術主要以二元關系為抽取對象,并未考慮二元關系的時效性,從而導致關系實例的可用性較差。 本文針對現(xiàn)有Web信息抽取技術未能充分利用Web上有價值的數(shù)據(jù),抽取結果內(nèi)容不夠豐富,可用性差的問題展開研究,主要工作和貢獻概括如下 1.提出一種基于SVM和擴展條件隨機場的Web實體活動抽取方法,能夠面向多領域,準確的從Web數(shù)據(jù)源抽取實體活動這一未被利用的數(shù)據(jù)類型。 Web實體活動是指存在于Web上描述實體行為活動的數(shù)據(jù),傳統(tǒng)信息抽取技術較少單獨考慮這一特殊的數(shù)據(jù)類型。本文首先對Web實體活動進行了詳細刻畫,基于格語法提出了實體活動的形式化定義,并提出一種基于SVM和擴展條件隨機場的Web實體活動抽取方法,能夠從Web上準確的抽取實體的活動信息。首先,為了避免人工標注訓練數(shù)據(jù)的繁重工作,提出一種基于啟發(fā)式規(guī)則的訓練數(shù)據(jù)生成算法,將語義角色標注的訓練數(shù)據(jù)集轉(zhuǎn)化為適合Web實體活動抽取的訓練數(shù)據(jù)集,分別訓練支持向量機分類器和擴展條件隨機場。在抽取過程中,通過分類器獲得包含實體活動的有效語句,然后利用擴展條件隨機場對傳統(tǒng)條件隨機場中不能夠利用的標簽頻率特征和關系特征建模,標注自然語句中的待抽取信息,提高標注的準確率。通過多領域的實驗證明,該抽取方法能夠較好的適用于Web實體活動抽取。 2.提出了一種自舉式Web實體關系時效信息抽取方法,有效解決了傳統(tǒng)關系抽取中時間維度缺失的問題,豐富抽取內(nèi)容,增強抽取結果的可用性。 傳統(tǒng)關系抽取主要以二元關系抽取為研究對象,但是現(xiàn)有抽取技術都是在假定關系實例時間無關性的基礎上進行的,導致了抽取結果的時間維度缺失、可以性差。針對以上問題,本文提出了一種自舉式的Web實體關系實效信息抽取方法,該方法能夠抽取給定關系類型下所有關系實例以及關系實例對應的時效信息。方法中,首先對待抽取的3元關系:二元關系中的2個實體以及關系的時效信息,進行重新建模,通過將實體關系視作一個事實維度形成新的二元關系,最后利用經(jīng)典的自舉式二元關系抽取方法進行關系實例和時效信息的抽取。相比傳統(tǒng)的自舉式抽取過程,本文引入馬爾科大邏輯網(wǎng),用于弱化規(guī)則和模板的硬性約束,提高抽取的召回率;通過引入L1范數(shù)模型選擇高質(zhì)量模板,幫助提高抽取結果的準確率;關系的抽取對象為Web上的自然語句,方法中引入語義解析,充分利用自然語句中的依賴特征。實驗證明,該方法能夠在多領域高效準確的抽取給定關系類型下的關系實例以及實例的對應時效信息,最后,通過實驗證明,在自舉式抽取過程中引入MLN、L1范數(shù)模型進行模板選擇以及語義解析對抽取結果的提高都有顯著幫助。
[Abstract]:With the rapid development of Internet technology, the Web has become a huge data source with massive data. How to efficiently, comprehensively and accurately integrate valuable information on the Web, provide data support for market intelligence analysis, search engine, intelligent question answering systems, enrich the knowledge base of market intelligence analysis and intelligent question answering systems, help Perfecting the results of analysis and reasoning makes the search engine return more accurate retrieval data, which becomes a hot and difficult point in data integration, information retrieval, natural language understanding and other fields. Structural data.
Web data has the characteristics of large-scale, heterogeneous, autonomous, distributed, and so on. The existing information extraction technology can not meet the needs of efficient, comprehensive and accurate data integration at the same time. Attributes (data objects), but the extraction method is limited by the extraction object domain, the extraction results are relatively simple, the information content is not rich enough: on the other hand, in the face of heterogeneous, autonomous unstructured Web data, the existing information extraction technology aims to extract the accuracy of the results, extraction efficiency can not meet the needs of large-scale information extraction. Yes.
This paper is devoted to the study of Web information extraction technology. The goal is to face large-scale, heterogeneous Web data, fully mine valuable information on the Web and enrich the content of information extraction. There are a lot of data describing entity behavior activities on the Web, and the existing information extraction technology can not be described in detail. In the face of large-scale Web data, the existing relational extraction technology mainly takes binary relation as the extraction object, and does not consider the timeliness of binary relation, which leads to the poor availability of relational instances.
In this paper, the existing Web information extraction technology can not make full use of the valuable data on the Web, extraction results are not rich enough content, poor usability of the problem to start research, the main work and contributions are summarized as follows
1. A Web entity activity extraction method based on SVM and extended conditional random field is proposed, which can extract entity activity from Web data source accurately and multi-domain.
Web entity activity refers to the data that exists on the Web to describe entity activity. Traditional information extraction technology seldom considers this special data type alone. Firstly, Web entity activity is described in detail, formal definition of entity activity is proposed based on lattice grammar, and a W-based SVM and extended conditional random field is proposed. EB entity activity extraction method can accurately extract entity activity information from the Web. Firstly, to avoid the heavy work of manual labeling training data, a training data generation algorithm based on heuristic rules is proposed, which transforms the training data set of semantic role labeling into the training data set suitable for Web entity activity extraction. Support Vector Machine (SVM) classifier and Extended Conditional Random Field (ESRF) are trained. In the extraction process, valid statements containing entity activities are obtained by classifier, and then label frequency features and relational features which can not be used in traditional conditional random fields are modeled by ESRF to annotate the information to be extracted from natural sentences and improve annotation. Experiments in many fields show that the proposed method is suitable for Web entity activity extraction.
2. A bootstrap Web entity relation timeliness information extraction method is proposed, which effectively solves the problem of missing time dimension in traditional relation extraction, enriches extraction content and enhances the availability of extraction results.
Traditional relational extraction mainly focuses on binary relational extraction, but the existing extraction techniques are based on the assumption that relational instances are time-independent, which leads to the lack of time dimension and poor feasibility of extraction results. This method can extract the time-effect information of all relational instances and relational instances under a given type of relation. Firstly, the time-effect information of the three-element relation: two entities in the binary relation and the relation is re-modeled, and the entity relation is regarded as a fact dimension to form a new binary relation. The classical bootstrap binary relation extraction method is used to extract relation instances and time information. Compared with the traditional bootstrap extraction process, this paper introduces the Markov Large Logic Network (MLN) to weaken the hard constraints of rules and templates and improve the recall rate of extraction. The experimental results show that this method can extract the corresponding time-effect information of relation instances and instances under given relation types efficiently and accurately in many fields. Finally, the experiment proves that the method is self-contained. The introduction of MLN, L1 norm model for template selection and semantic parsing in the process of enumeration extraction can significantly improve the extraction results.
【學位授予單位】:山東大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP311.13
本文編號:2209178
[Abstract]:With the rapid development of Internet technology, the Web has become a huge data source with massive data. How to efficiently, comprehensively and accurately integrate valuable information on the Web, provide data support for market intelligence analysis, search engine, intelligent question answering systems, enrich the knowledge base of market intelligence analysis and intelligent question answering systems, help Perfecting the results of analysis and reasoning makes the search engine return more accurate retrieval data, which becomes a hot and difficult point in data integration, information retrieval, natural language understanding and other fields. Structural data.
Web data has the characteristics of large-scale, heterogeneous, autonomous, distributed, and so on. The existing information extraction technology can not meet the needs of efficient, comprehensive and accurate data integration at the same time. Attributes (data objects), but the extraction method is limited by the extraction object domain, the extraction results are relatively simple, the information content is not rich enough: on the other hand, in the face of heterogeneous, autonomous unstructured Web data, the existing information extraction technology aims to extract the accuracy of the results, extraction efficiency can not meet the needs of large-scale information extraction. Yes.
This paper is devoted to the study of Web information extraction technology. The goal is to face large-scale, heterogeneous Web data, fully mine valuable information on the Web and enrich the content of information extraction. There are a lot of data describing entity behavior activities on the Web, and the existing information extraction technology can not be described in detail. In the face of large-scale Web data, the existing relational extraction technology mainly takes binary relation as the extraction object, and does not consider the timeliness of binary relation, which leads to the poor availability of relational instances.
In this paper, the existing Web information extraction technology can not make full use of the valuable data on the Web, extraction results are not rich enough content, poor usability of the problem to start research, the main work and contributions are summarized as follows
1. A Web entity activity extraction method based on SVM and extended conditional random field is proposed, which can extract entity activity from Web data source accurately and multi-domain.
Web entity activity refers to the data that exists on the Web to describe entity activity. Traditional information extraction technology seldom considers this special data type alone. Firstly, Web entity activity is described in detail, formal definition of entity activity is proposed based on lattice grammar, and a W-based SVM and extended conditional random field is proposed. EB entity activity extraction method can accurately extract entity activity information from the Web. Firstly, to avoid the heavy work of manual labeling training data, a training data generation algorithm based on heuristic rules is proposed, which transforms the training data set of semantic role labeling into the training data set suitable for Web entity activity extraction. Support Vector Machine (SVM) classifier and Extended Conditional Random Field (ESRF) are trained. In the extraction process, valid statements containing entity activities are obtained by classifier, and then label frequency features and relational features which can not be used in traditional conditional random fields are modeled by ESRF to annotate the information to be extracted from natural sentences and improve annotation. Experiments in many fields show that the proposed method is suitable for Web entity activity extraction.
2. A bootstrap Web entity relation timeliness information extraction method is proposed, which effectively solves the problem of missing time dimension in traditional relation extraction, enriches extraction content and enhances the availability of extraction results.
Traditional relational extraction mainly focuses on binary relational extraction, but the existing extraction techniques are based on the assumption that relational instances are time-independent, which leads to the lack of time dimension and poor feasibility of extraction results. This method can extract the time-effect information of all relational instances and relational instances under a given type of relation. Firstly, the time-effect information of the three-element relation: two entities in the binary relation and the relation is re-modeled, and the entity relation is regarded as a fact dimension to form a new binary relation. The classical bootstrap binary relation extraction method is used to extract relation instances and time information. Compared with the traditional bootstrap extraction process, this paper introduces the Markov Large Logic Network (MLN) to weaken the hard constraints of rules and templates and improve the recall rate of extraction. The experimental results show that this method can extract the corresponding time-effect information of relation instances and instances under given relation types efficiently and accurately in many fields. Finally, the experiment proves that the method is self-contained. The introduction of MLN, L1 norm model for template selection and semantic parsing in the process of enumeration extraction can significantly improve the extraction results.
【學位授予單位】:山東大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前4條
1 丁艷輝;李慶忠;董永權;彭朝暉;;基于集成學習和二維關聯(lián)邊條件隨機場的Web數(shù)據(jù)語義標注方法[J];計算機學報;2010年02期
2 董永權;李慶忠;丁艷輝;彭朝暉;;A Query Interface Matching Approach Based on Extended Evidence Theory for Deep Web[J];Journal of Computer Science & Technology;2010年03期
3 劉挺;車萬翔;李生;;基于最大熵分類器的語義角色標注[J];軟件學報;2007年03期
4 黃健斌;姬紅兵;孫鶴立;;基于混合跳鏈條件隨機場的異構Web記錄集成方法[J];軟件學報;2008年08期
,本文編號:2209178
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2209178.html
最近更新
教材專著