Web信息抽取中的若干關(guān)鍵問(wèn)題研究

發(fā)布時(shí)間：2018-04-26 13:19

本文選題：信息抽取 + 命名實(shí)體消歧�。� 參考：《中國(guó)科學(xué)技術(shù)大學(xué)》2015年碩士論文

【摘要】：近年來(lái),隨著Web應(yīng)用的快速發(fā)展,互聯(lián)網(wǎng)上的信息資源越來(lái)越豐富。在此背景下,Web信息抽取技術(shù)應(yīng)運(yùn)而生。Web信息抽取是一種從海量的數(shù)據(jù)中準(zhǔn)確獲取用戶所需的事實(shí)信息的處理技術(shù),涉及實(shí)體識(shí)別與抽取、關(guān)系抽取、實(shí)體消歧、觀點(diǎn)挖掘和傾向性分析等諸多問(wèn)題,目前已成為Web領(lǐng)域中的研究熱點(diǎn)之一。本文圍繞Web信息抽取領(lǐng)域中的兩類關(guān)鍵問(wèn)題——命名實(shí)體消歧和傾向性信息抽取——開(kāi)展了研究。命名實(shí)體消岐旨在消除Web中一個(gè)命名實(shí)體在指代概念上的歧義,從而確定其正確指代的實(shí)體。由于Web環(huán)境中一個(gè)命名實(shí)體指稱項(xiàng)可以對(duì)應(yīng)多個(gè)實(shí)體概念,如命名實(shí)體指稱項(xiàng)“華盛頓”既可以指代美國(guó)總統(tǒng)喬治華盛頓也可以指代首府華盛頓哥倫比亞特區(qū)。因此,命名實(shí)體消歧技術(shù)在Web問(wèn)答系統(tǒng)、信息檢索、機(jī)器翻譯等應(yīng)用領(lǐng)域有著重要的應(yīng)用價(jià)值。傾向性信息抽取關(guān)注于從海量的非結(jié)構(gòu)化的web數(shù)據(jù)中挖掘出觀點(diǎn)信息,繼而分析信息發(fā)布者對(duì)其發(fā)布信息的情感傾向性。傾向性信息抽取在現(xiàn)代生活中有著諸多的應(yīng)用,例如,可以幫助企業(yè)準(zhǔn)確獲取用戶對(duì)產(chǎn)品的評(píng)價(jià),以便優(yōu)化營(yíng)銷策略；可以為政府部門在輿情監(jiān)控、突發(fā)事件處理等提供決策依據(jù)。本文針對(duì)命名實(shí)體消岐和傾向性信息抽取中存在的主要挑戰(zhàn)開(kāi)展了算法設(shè)計(jì)、實(shí)驗(yàn)驗(yàn)證等工作。論文的主要貢獻(xiàn)可總結(jié)為如下幾點(diǎn)： (1)提出了一種基于維基百科的命名實(shí)體消歧方法,通過(guò)實(shí)體指稱項(xiàng)識(shí)別、候選實(shí)體庫(kù)構(gòu)建以及命名實(shí)體匹配等過(guò)程有效地實(shí)現(xiàn)了命名實(shí)體消岐。我們?cè)谠摲椒ㄖ刑岢隽艘环N新型的待消歧實(shí)體指稱項(xiàng)與候選實(shí)體之間的相似度計(jì)算方法,并利用維基百科的頁(yè)面來(lái)挖掘?qū)嶓w之間、實(shí)體指稱項(xiàng)與候選實(shí)體間的語(yǔ)義關(guān)聯(lián),最后在WISE Challenge2013數(shù)據(jù)集上驗(yàn)證了該方法的有效性。 (2)提出了一種基于句法依存關(guān)系和SVM的情感評(píng)價(jià)單元識(shí)別算法。情感評(píng)價(jià)單元在一個(gè)情感句中表現(xiàn)為情感傾向詞和它修飾的評(píng)價(jià)對(duì)象的搭配,其直接決定情感句的情感傾向性。論文提出的算法首先通過(guò)簡(jiǎn)單模式匹配抽取所有可能的候選情感評(píng)價(jià)單元,然后通過(guò)SVM模型對(duì)候選情感單元集合進(jìn)行過(guò)濾。在分類過(guò)程中,我們提出了基于句法依存關(guān)系來(lái)自動(dòng)構(gòu)建大規(guī)模訓(xùn)練集的方法,提高了分類模型訓(xùn)練的效率。在實(shí)際數(shù)據(jù)集上的實(shí)驗(yàn)表明該算法較以往的算法在準(zhǔn)確率和召回率上均有明顯的改善。
[Abstract]:In recent years, with the rapid development of Web applications, the information resources on the Internet are more and more abundant. Under this background, Web information extraction technology emerges as the times require. Web information extraction is a kind of processing technology that can accurately obtain the fact information that users need from massive data. It involves entity identification and extraction, relation extraction, entity disambiguation. Viewpoint mining and tendency analysis have become one of the research hotspots in Web field. This paper focuses on two kinds of key problems in the field of Web information extraction named entity disambiguation and biased information extraction. The purpose of named entity disambiguation is to eliminate the ambiguity of a named entity in Web in the concept of anaphora, so as to determine the entity with correct reference. Because a named entity reference in Web environment can correspond to several entity concepts, for example, the named entity reference term "Washington" can refer to both U.S. President George Washington and Washington, D.C. Therefore, named entity disambiguation technology has important application value in Web question answering system, information retrieval, machine translation and so on. Tendentiousness information extraction focuses on mining viewpoint information from massive unstructured web data, and then analyzes the emotional tendency of information publishers to publish information. Tendentiousness information extraction has many applications in modern life, for example, it can help enterprises to get accurate evaluation of products by users, in order to optimize marketing strategy, and can monitor public opinion for government departments. Emergency handling provides the basis for decision-making. In this paper, the algorithm design and experimental verification are carried out to solve the main challenges in the information extraction of named entities. The main contributions of the paper can be summarized as follows: (1) A named entity disambiguation method based on Wikipedia is proposed, which can effectively realize named entity disambiguation through entity reference identification, candidate entity library construction and named entity matching. In this method, we propose a new method to calculate the similarity between entity reference items and candidate entities, and use Wikipedia pages to mine the semantic association between entities, entity references and candidate entities. Finally, the effectiveness of the method is verified on the WISE Challenge2013 dataset. (2) A recognition algorithm of emotion evaluation unit based on syntactic dependency and SVM is proposed. The affective evaluation unit in an affective sentence is expressed as the collocation of the affective tendency word and the object it modifies, which directly determines the affective tendency of the affective sentence. The proposed algorithm firstly extracts all possible candidate emotion evaluation units by simple pattern matching, and then filters the set of candidate emotion units through SVM model. In the process of classification, we propose a method of automatically constructing large-scale training set based on syntactic dependency, which improves the efficiency of classification model training. The experiments on the actual data sets show that the proposed algorithm has better accuracy and recall than the previous algorithms.
【學(xué)位授予單位】：中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2015
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前6條

1 譚詠梅;楊雪;;結(jié)合實(shí)體鏈接與實(shí)體聚類的命名實(shí)體消歧[J];北京郵電大學(xué)學(xué)報(bào);2014年05期

2 李鈍;曹付元;曹元大;萬(wàn)月亮;;基于短語(yǔ)模式的文本情感分類研究[J];計(jì)算機(jī)科學(xué);2008年04期

3 章劍鋒;張奇;吳立德;黃萱菁;;中文觀點(diǎn)挖掘中的主觀性關(guān)系抽取[J];中文信息學(xué)報(bào);2008年02期

4 趙軍;;命名實(shí)體識(shí)別、排歧和跨語(yǔ)言關(guān)聯(lián)[J];中文信息學(xué)報(bào);2009年02期

5 黃萱菁;張奇;吳苑斌;;文本情感傾向分析[J];中文信息學(xué)報(bào);2011年06期

6 趙妍妍;秦兵;車萬(wàn)翔;劉挺;;基于句法路徑的情感評(píng)價(jià)單元識(shí)別[J];軟件學(xué)報(bào);2011年05期

相關(guān)博士學(xué)位論文前1條

1 張奇;信息抽取中實(shí)體關(guān)系識(shí)別研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2010年

，

本文編號(hào)：1806186

資料下載