基于兩階段聚類的人名消歧算法研究
發(fā)布時間:2018-02-27 19:17
本文關(guān)鍵詞: 人名消歧 屬性抽取 語義關(guān)系圖 聚類 出處:《東北大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)的普及,向搜索引擎提交查詢進行信息檢索已經(jīng)成為人們獲取網(wǎng)絡(luò)信息的主要方法。人名檢索是最常見的檢索之一,通過搜索引擎可以很方便的獲取一個人物的信息,但是由于人名重復(fù)現(xiàn)象十分普遍,以至于對于一個人名的檢索,搜索引擎常常返回一個很長的結(jié)果列表,包含了許多重名者。用戶要想找到特定的人物信息,必須通過添加特征來改善查詢,或者通過瀏覽的方式在結(jié)果列表中進行搜索,從眾多重名者的信息中找到想要查詢的人物信息,這樣會使搜索性能大大下降。因此,有必要研究一種有效的人名消歧算法來提高人名檢索效率。 本文在分析現(xiàn)有人名消歧相關(guān)理論與技術(shù)的基礎(chǔ)上,提出了兩階段聚類的人名消歧方法。人物屬性是對人名消歧很重要的特征,首先,本文抽取了16種主要的人物屬性,對于9種比較容易抽取的屬性,采用傳統(tǒng)正則表達模式和詞典匹配的方法,而針對7種抽取比較困難的屬性,采用一種基于自擴展的自動化抽取方法;然后,本文將搜索引擎返回的結(jié)果文檔用屬性向量表示,計算文檔之間的相似度;最后進行初步聚類。由于并非所有的網(wǎng)頁中都包含人物屬性信息;因此初步聚類之后許多沒有包含人物屬性信息的網(wǎng)頁不能被正確聚類。因此,本文提出了利用語義關(guān)系進行再次聚類的方法。首先,本文抽取維基百科中概念及概念之間語義關(guān)系,并對語義關(guān)系進行計算,構(gòu)建語義關(guān)系圖;其次,使用SimRank算法計算出任意兩個節(jié)點之間的相似度;然后將初步聚類的結(jié)果表示成維基百科概念向量;最后,根據(jù)概念語義關(guān)系計算簇之間相似度,進行第二次人名聚類。 實驗結(jié)果證明了我們所提出的兩階段聚類相結(jié)合的人名消歧算法在準確率和召回率上都有顯著提升,并且比先前的方法性能更優(yōu)。證明了本文提出的算法對人名消歧問題的解決是有效的。
[Abstract]:With the popularity of the Internet, submitting queries to search engines for information retrieval has become the main method for people to obtain network information. It is easy to get information about a person through search engines, but because the repetition of names is so common, search engines often return a long list of results for a search of a person's name. To find specific personas, users must improve queries by adding features, or search the results list by browsing. It is necessary to study an effective name disambiguation algorithm to improve the efficiency of human name retrieval. Based on the analysis of the existing theories and techniques of name disambiguation, this paper proposes a two-stage clustering method for disambiguation of human names. The character attribute is a very important feature for the disambiguation of a person's name. Firstly, 16 kinds of main character attributes are selected in this paper. For the 9 attributes which are easy to extract, the traditional canonical expression pattern and dictionary matching method are adopted, while for the seven kinds of attributes which are more difficult to extract, an automatic extraction method based on self-expansion is adopted. In this paper, the result document returned by search engine is represented by attribute vector, and the similarity between documents is calculated. Therefore, after the initial clustering, many web pages that do not contain the attribute information of people can not be correctly clustered. Therefore, this paper proposes a method of re-clustering using semantic relations. This paper extracts concepts from Wikipedia and their semantic relations, calculates semantic relations and constructs semantic relationship diagrams. Secondly, the similarity between any two nodes is calculated by using SimRank algorithm. Then the results of the initial clustering are expressed as the concept vector of Wikipedia. Finally, the similarity between clusters is calculated according to the semantic relationship of concepts, and the second clustering of names is carried out. The experimental results show that the proposed two-stage clustering algorithm can significantly improve the accuracy and recall rate of human name disambiguation. The performance of the proposed method is better than that of the previous method. It is proved that the proposed algorithm is effective in solving the name disambiguation problem.
【學(xué)位授予單位】:東北大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前2條
1 郎君;秦兵;宋巍;劉龍;劉挺;李生;;基于社會網(wǎng)絡(luò)的人名檢索結(jié)果重名消解[J];計算機學(xué)報;2009年07期
2 曹慶皇;鞠時光;楊曉琴;;基于關(guān)聯(lián)挖掘和語義聚類的Deep Web復(fù)雜匹配方法[J];計算機應(yīng)用研究;2009年12期
,本文編號:1543975
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1543975.html
最近更新
教材專著