Web信息抽取中的若干關(guān)鍵問題研究
發(fā)布時間:2018-04-26 13:19
本文選題:信息抽取 + 命名實體消歧; 參考:《中國科學(xué)技術(shù)大學(xué)》2015年碩士論文
【摘要】:近年來,隨著Web應(yīng)用的快速發(fā)展,互聯(lián)網(wǎng)上的信息資源越來越豐富。在此背景下,Web信息抽取技術(shù)應(yīng)運而生。Web信息抽取是一種從海量的數(shù)據(jù)中準(zhǔn)確獲取用戶所需的事實信息的處理技術(shù),涉及實體識別與抽取、關(guān)系抽取、實體消歧、觀點挖掘和傾向性分析等諸多問題,目前已成為Web領(lǐng)域中的研究熱點之一。 本文圍繞Web信息抽取領(lǐng)域中的兩類關(guān)鍵問題——命名實體消歧和傾向性信息抽取——開展了研究。命名實體消岐旨在消除Web中一個命名實體在指代概念上的歧義,從而確定其正確指代的實體。由于Web環(huán)境中一個命名實體指稱項可以對應(yīng)多個實體概念,如命名實體指稱項“華盛頓”既可以指代美國總統(tǒng)喬治華盛頓也可以指代首府華盛頓哥倫比亞特區(qū)。因此,命名實體消歧技術(shù)在Web問答系統(tǒng)、信息檢索、機器翻譯等應(yīng)用領(lǐng)域有著重要的應(yīng)用價值。傾向性信息抽取關(guān)注于從海量的非結(jié)構(gòu)化的web數(shù)據(jù)中挖掘出觀點信息,繼而分析信息發(fā)布者對其發(fā)布信息的情感傾向性。傾向性信息抽取在現(xiàn)代生活中有著諸多的應(yīng)用,例如,可以幫助企業(yè)準(zhǔn)確獲取用戶對產(chǎn)品的評價,以便優(yōu)化營銷策略;可以為政府部門在輿情監(jiān)控、突發(fā)事件處理等提供決策依據(jù)。 本文針對命名實體消岐和傾向性信息抽取中存在的主要挑戰(zhàn)開展了算法設(shè)計、實驗驗證等工作。論文的主要貢獻可總結(jié)為如下幾點: (1)提出了一種基于維基百科的命名實體消歧方法,通過實體指稱項識別、候選實體庫構(gòu)建以及命名實體匹配等過程有效地實現(xiàn)了命名實體消岐。我們在該方法中提出了一種新型的待消歧實體指稱項與候選實體之間的相似度計算方法,并利用維基百科的頁面來挖掘?qū)嶓w之間、實體指稱項與候選實體間的語義關(guān)聯(lián),最后在WISE Challenge2013數(shù)據(jù)集上驗證了該方法的有效性。 (2)提出了一種基于句法依存關(guān)系和SVM的情感評價單元識別算法。情感評價單元在一個情感句中表現(xiàn)為情感傾向詞和它修飾的評價對象的搭配,其直接決定情感句的情感傾向性。論文提出的算法首先通過簡單模式匹配抽取所有可能的候選情感評價單元,然后通過SVM模型對候選情感單元集合進行過濾。在分類過程中,我們提出了基于句法依存關(guān)系來自動構(gòu)建大規(guī)模訓(xùn)練集的方法,提高了分類模型訓(xùn)練的效率。在實際數(shù)據(jù)集上的實驗表明該算法較以往的算法在準(zhǔn)確率和召回率上均有明顯的改善。
[Abstract]:In recent years, with the rapid development of Web applications, the information resources on the Internet are more and more abundant. Under this background, Web information extraction technology emerges as the times require. Web information extraction is a kind of processing technology that can accurately obtain the fact information that users need from massive data. It involves entity identification and extraction, relation extraction, entity disambiguation. Viewpoint mining and tendency analysis have become one of the research hotspots in Web field. This paper focuses on two kinds of key problems in the field of Web information extraction named entity disambiguation and biased information extraction. The purpose of named entity disambiguation is to eliminate the ambiguity of a named entity in Web in the concept of anaphora, so as to determine the entity with correct reference. Because a named entity reference in Web environment can correspond to several entity concepts, for example, the named entity reference term "Washington" can refer to both U.S. President George Washington and Washington, D.C. Therefore, named entity disambiguation technology has important application value in Web question answering system, information retrieval, machine translation and so on. Tendentiousness information extraction focuses on mining viewpoint information from massive unstructured web data, and then analyzes the emotional tendency of information publishers to publish information. Tendentiousness information extraction has many applications in modern life, for example, it can help enterprises to get accurate evaluation of products by users, in order to optimize marketing strategy, and can monitor public opinion for government departments. Emergency handling provides the basis for decision-making. In this paper, the algorithm design and experimental verification are carried out to solve the main challenges in the information extraction of named entities. The main contributions of the paper can be summarized as follows: (1) A named entity disambiguation method based on Wikipedia is proposed, which can effectively realize named entity disambiguation through entity reference identification, candidate entity library construction and named entity matching. In this method, we propose a new method to calculate the similarity between entity reference items and candidate entities, and use Wikipedia pages to mine the semantic association between entities, entity references and candidate entities. Finally, the effectiveness of the method is verified on the WISE Challenge2013 dataset. (2) A recognition algorithm of emotion evaluation unit based on syntactic dependency and SVM is proposed. The affective evaluation unit in an affective sentence is expressed as the collocation of the affective tendency word and the object it modifies, which directly determines the affective tendency of the affective sentence. The proposed algorithm firstly extracts all possible candidate emotion evaluation units by simple pattern matching, and then filters the set of candidate emotion units through SVM model. In the process of classification, we propose a method of automatically constructing large-scale training set based on syntactic dependency, which improves the efficiency of classification model training. The experiments on the actual data sets show that the proposed algorithm has better accuracy and recall than the previous algorithms.
【學(xué)位授予單位】:中國科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前6條
1 譚詠梅;楊雪;;結(jié)合實體鏈接與實體聚類的命名實體消歧[J];北京郵電大學(xué)學(xué)報;2014年05期
2 李鈍;曹付元;曹元大;萬月亮;;基于短語模式的文本情感分類研究[J];計算機科學(xué);2008年04期
3 章劍鋒;張奇;吳立德;黃萱菁;;中文觀點挖掘中的主觀性關(guān)系抽取[J];中文信息學(xué)報;2008年02期
4 趙軍;;命名實體識別、排歧和跨語言關(guān)聯(lián)[J];中文信息學(xué)報;2009年02期
5 黃萱菁;張奇;吳苑斌;;文本情感傾向分析[J];中文信息學(xué)報;2011年06期
6 趙妍妍;秦兵;車萬翔;劉挺;;基于句法路徑的情感評價單元識別[J];軟件學(xué)報;2011年05期
相關(guān)博士學(xué)位論文 前1條
1 張奇;信息抽取中實體關(guān)系識別研究[D];中國科學(xué)技術(shù)大學(xué);2010年
,本文編號:1806186
本文鏈接:http://sikaile.net/guanlilunwen/yingxiaoguanlilunwen/1806186.html
最近更新
教材專著