實(shí)體搜索與實(shí)體解析方法研究
本文選題:實(shí)體搜索 + 實(shí)體解析 ; 參考:《蘭州大學(xué)》2012年博士論文
【摘要】:從非結(jié)構(gòu)/半結(jié)構(gòu)化數(shù)據(jù)中快速準(zhǔn)確地搜索到各種實(shí)體(例如人名、組織機(jī)構(gòu)、產(chǎn)品和藥品)及其相關(guān)信息成為很多應(yīng)用的關(guān)鍵,包括信息檢索、推薦系統(tǒng)和社交網(wǎng)絡(luò)等。近幾年的研究成果顯示,實(shí)體相關(guān)搜索占互聯(lián)網(wǎng)查詢的很大一部分,并且這個(gè)比例在不斷上升。相對(duì)于單個(gè)字符或者指定長(zhǎng)度的短語,實(shí)體能夠更準(zhǔn)確的描述文本的語義特征,從而幫助用戶快速了解文本的核心內(nèi)容。然而,隨著互聯(lián)網(wǎng)數(shù)據(jù)的不斷增長(zhǎng),信息檢索變得越來越困難,尤其是實(shí)體的不唯一性(歧義性)成為一個(gè)普遍存在的問題。首先,許多不同的實(shí)體擁有完全相同的名稱,例如在中國有超過29萬人叫“張偉”;在查詢框中輸入一個(gè)實(shí)體名稱,搜索引擎返回的前100個(gè)網(wǎng)頁常常會(huì)涉及到多個(gè)共享相同名字的不同對(duì)象。其次,同一個(gè)實(shí)體常常會(huì)以多種形式存在于不同數(shù)據(jù)源中(即別名),例如“中華人名共和國”常常被稱為“中國”或“P.R.C”;劉翔曾被譽(yù)為“亞洲飛人”等。在醫(yī)藥業(yè)的“一藥多名”和“一名多藥”問題也很嚴(yán)重,藥品名稱的不唯一性匹配,為正確用藥帶來了巨大的阻礙。以上兩個(gè)問題分別為實(shí)體同名歧義和實(shí)體別名識(shí)別,這兩個(gè)問題的解決過程是相對(duì)的同時(shí)也是密切相關(guān)的,他們是實(shí)體搜索和解析過程中的兩個(gè)最重要的問題。本篇文章針對(duì)實(shí)體搜索工作進(jìn)行了大量的調(diào)研,分析了包括表層網(wǎng)絡(luò)、社交網(wǎng)絡(luò)以及企業(yè)內(nèi)部網(wǎng)絡(luò)等不同來源的數(shù)據(jù)特性。并針對(duì)實(shí)體同名歧義和實(shí)體別名問題分別提出有效的解決方案。此外,基于本文提出的實(shí)體同名消歧的解決方案,我們開發(fā)了一個(gè)人物搜索系統(tǒng)。并對(duì)本文提出的別名發(fā)現(xiàn)解決方案進(jìn)行擴(kuò)展,使其適用于動(dòng)態(tài)數(shù)據(jù)環(huán)境。在這些研究中,我們重在對(duì)非結(jié)構(gòu)化文本進(jìn)行分析,充分利用自然語言處理方法探索文本中的單詞、實(shí)體、句子的結(jié)構(gòu)特征和內(nèi)容特征,通過數(shù)據(jù)挖掘算法為這些信息建立聯(lián)系,以解決實(shí)體搜索和實(shí)體解析中遇到的問題。本論文的主要貢獻(xiàn)如下: 1.實(shí)體搜索綜述。介紹了實(shí)體搜索中遇到的問題及采用的技術(shù)方法,簡(jiǎn)單描述了現(xiàn)有人名搜索系統(tǒng)、人名搜索相關(guān)問題及未來研究方向。 2.實(shí)體同名消歧。以人名消歧為例進(jìn)行相關(guān)研究,利用自然語言處理工具對(duì)搜索引擎返回的非結(jié)構(gòu)化文檔進(jìn)行命名實(shí)體提取,將提取的實(shí)體作為人物標(biāo)簽,建立基于實(shí)體標(biāo)簽的圖結(jié)構(gòu),最終為擁有相同姓名的不同的人分配實(shí)體標(biāo)簽對(duì)其進(jìn)行唯一性描述。另外,我們開發(fā)的人名搜索系統(tǒng)將給定的人名作為查詢?cè)~,輸入到現(xiàn)有搜索引擎(即谷歌、雅虎或必應(yīng))中,利用我們提出的消歧方法對(duì)返回的結(jié)果進(jìn)行人物同名消歧,使得用戶可以清晰看到擁有查詢?nèi)嗣牟煌宋锏年P(guān)鍵實(shí)體信息。 3.實(shí)體別名發(fā)現(xiàn)。本文對(duì)實(shí)體-別名之間存在字符串相似性和無字符串相似性的兩種情況分別進(jìn)行研究。對(duì)于第一種情況,我們首先基于字符相似性提取出別名候選,然后建立實(shí)體-關(guān)系圖進(jìn)行別名選取。對(duì)于別名與原實(shí)體基本不存在字符相似性的情況,研究工作面臨更多挑戰(zhàn),本文提出基于實(shí)體子集分割的方法進(jìn)行別名候選的篩選,然后通過主動(dòng)學(xué)習(xí)的分類方法來確定給定實(shí)體的最終別名?傮w來說,本文的實(shí)體別名發(fā)現(xiàn)方法旨在通過探索給定數(shù)據(jù)集中實(shí)體之間的關(guān)系,設(shè)計(jì)初始過濾方法來提取給定實(shí)體的別名候選,然后使用非監(jiān)督式/監(jiān)督式方法來探尋給定實(shí)體與別名候選之間的相關(guān)性,最終為每一個(gè)給定實(shí)體輸出一個(gè)別名列表。 4.動(dòng)態(tài)實(shí)體別名發(fā)現(xiàn)。隨著新的數(shù)據(jù)添加到給定數(shù)據(jù)集中,基于這個(gè)數(shù)據(jù)集而建立的實(shí)體-關(guān)系圖結(jié)構(gòu)也需要進(jìn)行相應(yīng)的更新操作(點(diǎn)邊的插入、刪除和修改),以往的靜態(tài)解決方案已不再適用于這樣的動(dòng)態(tài)環(huán)境,因此,本文提出基于實(shí)體索引的路徑搜索方法,以此來實(shí)現(xiàn)動(dòng)態(tài)圖的更新,并將這個(gè)動(dòng)態(tài)方案用于增量式的實(shí)體別名發(fā)現(xiàn)問題中。
[Abstract]:The rapid and accurate search of various entities (such as human names, organizations, products and medicines) and related information from unstructured / semi-structured data and related information has become the key to many applications, including information retrieval, recommendation systems and social networks. Research results in recent years show that entity related search is a large part of the Internet query. And this proportion is rising. As opposed to single character or specified length phrase, the entity can describe the semantic features of the text more accurately, thus helping the user to quickly understand the core content of the text. However, as the Internet data continues to grow, information retrieval becomes more and more difficult, especially the entity is not unique. Meaning) becomes a common problem. First, many different entities have exactly the same names, such as more than 290 thousand people in China called "Zhang Wei"; in the query box, enter an entity name, and the first 100 pages returned by the search engine often involve a number of different objects that share the same name. Secondly, the same Entities often exist in a variety of forms in different sources (alias), such as the "Chinese name republic", which is often referred to as "China" or "P.R.C"; Liu Xiang has been known as "Asian flying man". In the pharmaceutical industry, the "one drug" and "one multidrug" problem are very serious, and the name of the drug is not unique. The two problems are the entity homonym ambiguity and the entity alias identification respectively. The two problems are relative and closely related. They are the two most important problems in the process of entity search and parsing. This article conducted a lot of research on the entity search work. The data characteristics of different sources including the surface network, social network and the enterprise internal network are analyzed. The effective solutions are proposed for the entity naming ambiguity and the entity alias problem respectively. In addition, based on the solution of the entity homonym disambiguation proposed in this paper, we have issued a character search system. In these studies, we focus on the analysis of unstructured text, and make full use of the Natural Language Processing method to explore the word, entity, structure and content of the text by using the Natural Language Processing method, so as to connect the information through the data mining algorithm. Solving the problems encountered in entity search and entity analysis. The main contributions of this paper are as follows:
1. entity search overview. This paper introduces the problems encountered in the entity search and the technical methods used, briefly describes the existing name search system, the related problems of human name search and the future research direction.
The 2. entity is the same name disambiguation. Taking the name disambiguation as an example, we use the Natural Language Processing tool to extract the unstructured documents returned by the search engine. The extracted entity is used as the character label to establish the graph structure based on the entity label. Finally, the entity labels are assigned to the different people with the same name. In addition, we have developed a name search system that uses a given name as a query word to enter the existing search engine (that is, Google, YAHOO or Bing), using the disambiguation method we proposed to disambiguate the returned results, so that users can clearly see the different personages who have the names of the people. Key entity information.
3. entity alias discovery. This paper studies the two cases of string similarity and non string similarity between entity and alias. For the first case, we first extract the alias candidate based on the character similarity, and then establish the entity relation graph to choose other names. There is basically no word for the alias and the original entity. In the case of character similarity, the research work faces more challenges. This paper proposes the selection of alias candidates based on the entity subset segmentation method, and then determines the final alias by the active learning classification method. In general, the entity alias discovery method of this paper is aimed at exploring the given data centralization entity. The initial filtering method is designed to extract the alias candidate of a given entity, and then the unsupervised / supervised method is used to explore the correlation between the given entity and the alias candidate, and then a list of aliases is output for each given entity.
4. dynamic entity alias discovery. As the new data is added to a given data set, the entity relational graph structure based on this dataset also needs to be updated (insertion, deletion and modification), and the previous static solutions are no longer applicable to such dynamic environments. Therefore, this paper proposes a solid cable based on the entity cable. The path search method is used to realize the updating of dynamic graph and apply the dynamic solution to the incremental entity alias detection problem.
【學(xué)位授予單位】:蘭州大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
【共引文獻(xiàn)】
相關(guān)期刊論文 前4條
1 龐雄文;姚占林;李擁軍;;大數(shù)據(jù)量的高效重復(fù)記錄檢測(cè)方法[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年02期
2 趙軍;;命名實(shí)體識(shí)別、排歧和跨語言關(guān)聯(lián)[J];中文信息學(xué)報(bào);2009年02期
3 張巖;楊龍;王宏志;;劣質(zhì)數(shù)據(jù)庫上閾值相似連接結(jié)果大小估計(jì)[J];計(jì)算機(jī)學(xué)報(bào);2012年10期
4 李琦;馬軍;;基于人物相關(guān)社區(qū)的重名消解研究[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2012年03期
相關(guān)博士學(xué)位論文 前2條
1 周春英;超數(shù)據(jù)集成挖掘方法與技術(shù)研究[D];浙江大學(xué);2012年
2 張永新;面向Web數(shù)據(jù)集成的數(shù)據(jù)融合問題研究[D];山東大學(xué);2012年
相關(guān)碩士學(xué)位論文 前6條
1 趙飛國;面向數(shù)據(jù)挖掘的數(shù)據(jù)預(yù)處理系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];北京交通大學(xué);2011年
2 徐凱豐;中文語義萬維網(wǎng)本體匹配[D];上海交通大學(xué);2011年
3 徐銳波;應(yīng)用于搜索引擎的人物分類系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];華中科技大學(xué);2011年
4 裴飛;基于聚類的英漢人名消歧研究[D];蘇州大學(xué);2011年
5 王峰;同名排歧方法研究及其應(yīng)用[D];清華大學(xué);2009年
6 傅臨云;數(shù)據(jù)萬維網(wǎng)自動(dòng)實(shí)體匹配[D];上海交通大學(xué);2010年
,本文編號(hào):2058714
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2058714.html