天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

Web實(shí)體活動(dòng)與實(shí)體關(guān)系抽取研究

發(fā)布時(shí)間:2018-08-28 10:53
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的迅速發(fā)展,Web已經(jīng)成為一個(gè)巨大的數(shù)據(jù)源,擁有海量數(shù)據(jù)。如何高效、全面、準(zhǔn)確的集成Web上有價(jià)值的信息,為市場(chǎng)情報(bào)分析、搜索引擎、智能問答等系統(tǒng)提供數(shù)據(jù)支持,豐富市場(chǎng)情報(bào)分析和智能問答等系統(tǒng)的知識(shí)庫(kù),幫助完善分析推理的結(jié)果,使搜索引擎返回更加精準(zhǔn)的檢索數(shù)據(jù),成為數(shù)據(jù)集成、信息檢索、自然語言理解等領(lǐng)域研究的熱點(diǎn)和難點(diǎn)。要集成Web數(shù)據(jù),首要問題是如何將Web上的無結(jié)構(gòu)和半結(jié)構(gòu)化數(shù)據(jù)通過信息抽取技術(shù)轉(zhuǎn)變?yōu)橛?jì)算機(jī)可讀的結(jié)構(gòu)化數(shù)據(jù)。 Web數(shù)據(jù)具有大規(guī)模、異構(gòu)性、自治性、分布式等特點(diǎn),現(xiàn)有的信息抽取技術(shù)無法同時(shí)滿足高效、全面和準(zhǔn)確的數(shù)據(jù)集成需求。一方面,在面對(duì)大規(guī)模、分布式的Web數(shù)據(jù)時(shí),現(xiàn)有的信息抽取技術(shù)旨在高效的抽取Web上的命名實(shí)體、實(shí)體關(guān)系和實(shí)體屬性(數(shù)據(jù)對(duì)象),但是抽取方法受抽取對(duì)象領(lǐng)域的限制,抽取結(jié)果較為簡(jiǎn)單,信息內(nèi)容不夠豐富:另一方面,面對(duì)異構(gòu)性、自治性強(qiáng)的無結(jié)構(gòu)化Web數(shù)據(jù),現(xiàn)有的信息抽取技術(shù)旨在抽取結(jié)果的準(zhǔn)確性,抽取效率不能滿足大規(guī)模信息抽取的需要。 本文致力于研究Web信息抽取技術(shù),目標(biāo)在于在保障抽取結(jié)果準(zhǔn)確率的前提下,面向大規(guī)模、異構(gòu)性的Web數(shù)據(jù),充分挖掘Web上的有價(jià)值信息,豐富信息抽取的內(nèi)容。Web上存在大量描述實(shí)體行為活動(dòng)的數(shù)據(jù),現(xiàn)有的信息抽取技術(shù)未能詳細(xì)刻畫和抽取實(shí)體活動(dòng)這一類特殊信息;面對(duì)大規(guī)模Web數(shù)據(jù),現(xiàn)有的關(guān)系抽取技術(shù)主要以二元關(guān)系為抽取對(duì)象,并未考慮二元關(guān)系的時(shí)效性,從而導(dǎo)致關(guān)系實(shí)例的可用性較差。 本文針對(duì)現(xiàn)有Web信息抽取技術(shù)未能充分利用Web上有價(jià)值的數(shù)據(jù),抽取結(jié)果內(nèi)容不夠豐富,可用性差的問題展開研究,主要工作和貢獻(xiàn)概括如下 1.提出一種基于SVM和擴(kuò)展條件隨機(jī)場(chǎng)的Web實(shí)體活動(dòng)抽取方法,能夠面向多領(lǐng)域,準(zhǔn)確的從Web數(shù)據(jù)源抽取實(shí)體活動(dòng)這一未被利用的數(shù)據(jù)類型。 Web實(shí)體活動(dòng)是指存在于Web上描述實(shí)體行為活動(dòng)的數(shù)據(jù),傳統(tǒng)信息抽取技術(shù)較少單獨(dú)考慮這一特殊的數(shù)據(jù)類型。本文首先對(duì)Web實(shí)體活動(dòng)進(jìn)行了詳細(xì)刻畫,基于格語法提出了實(shí)體活動(dòng)的形式化定義,并提出一種基于SVM和擴(kuò)展條件隨機(jī)場(chǎng)的Web實(shí)體活動(dòng)抽取方法,能夠從Web上準(zhǔn)確的抽取實(shí)體的活動(dòng)信息。首先,為了避免人工標(biāo)注訓(xùn)練數(shù)據(jù)的繁重工作,提出一種基于啟發(fā)式規(guī)則的訓(xùn)練數(shù)據(jù)生成算法,將語義角色標(biāo)注的訓(xùn)練數(shù)據(jù)集轉(zhuǎn)化為適合Web實(shí)體活動(dòng)抽取的訓(xùn)練數(shù)據(jù)集,分別訓(xùn)練支持向量機(jī)分類器和擴(kuò)展條件隨機(jī)場(chǎng)。在抽取過程中,通過分類器獲得包含實(shí)體活動(dòng)的有效語句,然后利用擴(kuò)展條件隨機(jī)場(chǎng)對(duì)傳統(tǒng)條件隨機(jī)場(chǎng)中不能夠利用的標(biāo)簽頻率特征和關(guān)系特征建模,標(biāo)注自然語句中的待抽取信息,提高標(biāo)注的準(zhǔn)確率。通過多領(lǐng)域的實(shí)驗(yàn)證明,該抽取方法能夠較好的適用于Web實(shí)體活動(dòng)抽取。 2.提出了一種自舉式Web實(shí)體關(guān)系時(shí)效信息抽取方法,有效解決了傳統(tǒng)關(guān)系抽取中時(shí)間維度缺失的問題,豐富抽取內(nèi)容,增強(qiáng)抽取結(jié)果的可用性。 傳統(tǒng)關(guān)系抽取主要以二元關(guān)系抽取為研究對(duì)象,但是現(xiàn)有抽取技術(shù)都是在假定關(guān)系實(shí)例時(shí)間無關(guān)性的基礎(chǔ)上進(jìn)行的,導(dǎo)致了抽取結(jié)果的時(shí)間維度缺失、可以性差。針對(duì)以上問題,本文提出了一種自舉式的Web實(shí)體關(guān)系實(shí)效信息抽取方法,該方法能夠抽取給定關(guān)系類型下所有關(guān)系實(shí)例以及關(guān)系實(shí)例對(duì)應(yīng)的時(shí)效信息。方法中,首先對(duì)待抽取的3元關(guān)系:二元關(guān)系中的2個(gè)實(shí)體以及關(guān)系的時(shí)效信息,進(jìn)行重新建模,通過將實(shí)體關(guān)系視作一個(gè)事實(shí)維度形成新的二元關(guān)系,最后利用經(jīng)典的自舉式二元關(guān)系抽取方法進(jìn)行關(guān)系實(shí)例和時(shí)效信息的抽取。相比傳統(tǒng)的自舉式抽取過程,本文引入馬爾科大邏輯網(wǎng),用于弱化規(guī)則和模板的硬性約束,提高抽取的召回率;通過引入L1范數(shù)模型選擇高質(zhì)量模板,幫助提高抽取結(jié)果的準(zhǔn)確率;關(guān)系的抽取對(duì)象為Web上的自然語句,方法中引入語義解析,充分利用自然語句中的依賴特征。實(shí)驗(yàn)證明,該方法能夠在多領(lǐng)域高效準(zhǔn)確的抽取給定關(guān)系類型下的關(guān)系實(shí)例以及實(shí)例的對(duì)應(yīng)時(shí)效信息,最后,通過實(shí)驗(yàn)證明,在自舉式抽取過程中引入MLN、L1范數(shù)模型進(jìn)行模板選擇以及語義解析對(duì)抽取結(jié)果的提高都有顯著幫助。
[Abstract]:With the rapid development of Internet technology, the Web has become a huge data source with massive data. How to efficiently, comprehensively and accurately integrate valuable information on the Web, provide data support for market intelligence analysis, search engine, intelligent question answering systems, enrich the knowledge base of market intelligence analysis and intelligent question answering systems, help Perfecting the results of analysis and reasoning makes the search engine return more accurate retrieval data, which becomes a hot and difficult point in data integration, information retrieval, natural language understanding and other fields. Structural data.
Web data has the characteristics of large-scale, heterogeneous, autonomous, distributed, and so on. The existing information extraction technology can not meet the needs of efficient, comprehensive and accurate data integration at the same time. Attributes (data objects), but the extraction method is limited by the extraction object domain, the extraction results are relatively simple, the information content is not rich enough: on the other hand, in the face of heterogeneous, autonomous unstructured Web data, the existing information extraction technology aims to extract the accuracy of the results, extraction efficiency can not meet the needs of large-scale information extraction. Yes.
This paper is devoted to the study of Web information extraction technology. The goal is to face large-scale, heterogeneous Web data, fully mine valuable information on the Web and enrich the content of information extraction. There are a lot of data describing entity behavior activities on the Web, and the existing information extraction technology can not be described in detail. In the face of large-scale Web data, the existing relational extraction technology mainly takes binary relation as the extraction object, and does not consider the timeliness of binary relation, which leads to the poor availability of relational instances.
In this paper, the existing Web information extraction technology can not make full use of the valuable data on the Web, extraction results are not rich enough content, poor usability of the problem to start research, the main work and contributions are summarized as follows
1. A Web entity activity extraction method based on SVM and extended conditional random field is proposed, which can extract entity activity from Web data source accurately and multi-domain.
Web entity activity refers to the data that exists on the Web to describe entity activity. Traditional information extraction technology seldom considers this special data type alone. Firstly, Web entity activity is described in detail, formal definition of entity activity is proposed based on lattice grammar, and a W-based SVM and extended conditional random field is proposed. EB entity activity extraction method can accurately extract entity activity information from the Web. Firstly, to avoid the heavy work of manual labeling training data, a training data generation algorithm based on heuristic rules is proposed, which transforms the training data set of semantic role labeling into the training data set suitable for Web entity activity extraction. Support Vector Machine (SVM) classifier and Extended Conditional Random Field (ESRF) are trained. In the extraction process, valid statements containing entity activities are obtained by classifier, and then label frequency features and relational features which can not be used in traditional conditional random fields are modeled by ESRF to annotate the information to be extracted from natural sentences and improve annotation. Experiments in many fields show that the proposed method is suitable for Web entity activity extraction.
2. A bootstrap Web entity relation timeliness information extraction method is proposed, which effectively solves the problem of missing time dimension in traditional relation extraction, enriches extraction content and enhances the availability of extraction results.
Traditional relational extraction mainly focuses on binary relational extraction, but the existing extraction techniques are based on the assumption that relational instances are time-independent, which leads to the lack of time dimension and poor feasibility of extraction results. This method can extract the time-effect information of all relational instances and relational instances under a given type of relation. Firstly, the time-effect information of the three-element relation: two entities in the binary relation and the relation is re-modeled, and the entity relation is regarded as a fact dimension to form a new binary relation. The classical bootstrap binary relation extraction method is used to extract relation instances and time information. Compared with the traditional bootstrap extraction process, this paper introduces the Markov Large Logic Network (MLN) to weaken the hard constraints of rules and templates and improve the recall rate of extraction. The experimental results show that this method can extract the corresponding time-effect information of relation instances and instances under given relation types efficiently and accurately in many fields. Finally, the experiment proves that the method is self-contained. The introduction of MLN, L1 norm model for template selection and semantic parsing in the process of enumeration extraction can significantly improve the extraction results.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前4條

1 丁艷輝;李慶忠;董永權(quán);彭朝暉;;基于集成學(xué)習(xí)和二維關(guān)聯(lián)邊條件隨機(jī)場(chǎng)的Web數(shù)據(jù)語義標(biāo)注方法[J];計(jì)算機(jī)學(xué)報(bào);2010年02期

2 董永權(quán);李慶忠;丁艷輝;彭朝暉;;A Query Interface Matching Approach Based on Extended Evidence Theory for Deep Web[J];Journal of Computer Science & Technology;2010年03期

3 劉挺;車萬翔;李生;;基于最大熵分類器的語義角色標(biāo)注[J];軟件學(xué)報(bào);2007年03期

4 黃健斌;姬紅兵;孫鶴立;;基于混合跳鏈條件隨機(jī)場(chǎng)的異構(gòu)Web記錄集成方法[J];軟件學(xué)報(bào);2008年08期

,

本文編號(hào):2209178

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2209178.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶22582***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
欧美精品亚洲精品日韩专区| 亚洲熟女诱惑一区二区| 国产精品日本女优在线观看| 国产福利一区二区三区四区| 小黄片大全欧美一区二区| 经典欧美熟女激情综合网| 欧美亚洲另类久久久精品| 伊人久久五月天综合网| 日本一品道在线免费观看| 国产在线一区二区免费| 欧美日韩乱码一区二区三区| 夫妻性生活一级黄色录像| 日本高清一道一二三区四五区| 免费黄片视频美女一区| 中文字幕有码视频熟女| 日韩无套内射免费精品| 亚洲中文字幕免费人妻| 激情图日韩精品中文字幕| 激情综合网俺也狠狠地| 麻豆亚州无矿码专区视频| 中文字幕人妻日本一区二区| 五月婷婷亚洲综合一区| 欧美偷拍一区二区三区四区| 色婷婷中文字幕在线视频| 国产精品久久精品国产| 亚洲a级一区二区不卡| 风间中文字幕亚洲一区| 夜色福利久久精品福利| 国产av精品一区二区| 东京热加勒比一区二区三区| 国产免费观看一区二区| 亚洲国产香蕉视频在线观看| 日本视频在线观看不卡| 视频在线免费观看你懂的| 日本人妻熟女一区二区三区| 国产精品午夜性色视频| 九九热视频免费在线视频| 欧美野外在线刺激在线观看| 夜色福利久久精品福利| 人妻亚洲一区二区三区| 不卡中文字幕在线免费看|