相關(guān)實(shí)體查找與主頁(yè)查找研究
發(fā)布時(shí)間:2018-06-06 21:11
本文選題:TREC + REF ; 參考:《北京郵電大學(xué)》2013年碩士論文
【摘要】:REF (Related Entity Finding,相關(guān)實(shí)體查找)是TREC (Text Retrieval Conference,文本檢索會(huì)議)實(shí)體檢索中非常有前景的研究課題,對(duì)它的研究將對(duì)搜索引擎和人們對(duì)網(wǎng)絡(luò)信息的處理方式帶來巨大的改變。REF的要求是根據(jù)提供的topic的信息,通過互聯(lián)網(wǎng)和相關(guān)數(shù)據(jù)庫(kù)抽取出與topic相對(duì)應(yīng)的相關(guān)實(shí)體答案以及對(duì)應(yīng)實(shí)體主頁(yè)。本文對(duì)國(guó)內(nèi)外的現(xiàn)狀和一些前沿的算法進(jìn)行了研究,并對(duì)關(guān)鍵詞的提取和擴(kuò)展,文本的檢索,段落的切分和相關(guān)度計(jì)算,命名實(shí)體識(shí)別,實(shí)體排序和支撐文檔的檢索等幾個(gè)方面逐個(gè)分析和研究,對(duì)實(shí)現(xiàn)過程的改進(jìn)和創(chuàng)新如下: (1)對(duì)于以往的對(duì)整個(gè)網(wǎng)頁(yè)文本進(jìn)行處理的方式做了改進(jìn),增加了對(duì)于短文本即段落的處理方式,從而剔除了大量的不相關(guān)文本內(nèi)容,減小了返回文本的大小,提高了系統(tǒng)的處理效率。 (2)根據(jù)Wikipedia的結(jié)構(gòu)特點(diǎn),利用Wikipedia中的同義詞和上位詞等構(gòu)建基于Wikipedia的類別詞典,并用于實(shí)體抽取部分,適應(yīng)了今年REF項(xiàng)目的實(shí)體類型多而細(xì)的特點(diǎn),同時(shí)提高了實(shí)體抽取的準(zhǔn)確率。 (3)添加了基于詞密度的算法,實(shí)現(xiàn)了對(duì)DCM模型結(jié)果的校對(duì),取得了比較好的效果。并根據(jù)去年的答案對(duì)DCM文檔中心模型的計(jì)算公式中的參數(shù)做了調(diào)整,對(duì)模型進(jìn)行了改進(jìn)。
[Abstract]:Ref / related entity search is a very promising research topic in TREC / text Retrieval Conference. The research on it will bring great changes to the search engine and the way people deal with the information on the network. The requirements of the. Ref are based on the information provided by the topic. Through the Internet and related databases to extract the corresponding topic related entity answers and the corresponding entities home page. In this paper, the current situation at home and abroad and some advanced algorithms have been studied, and the keyword extraction and extension, text retrieval, paragraph segmentation and correlation calculation, named entity recognition, Several aspects, such as entity sorting and supporting document retrieval, are analyzed and studied one by one. The improvement and innovation of the implementation process are as follows: 1) improving the way of dealing with the whole web page text in the past. The method of processing short text is added to eliminate a large amount of irrelevant text content, reduce the size of returned text, and improve the efficiency of the system. Using Wikipedia synonyms and upper words to build a Wikipedia based category dictionary, which is used for entity extraction, adapts to the characteristics of this year's ref project, which is characterized by a large number of entity types. At the same time, the accuracy of entity extraction is improved. (3) an algorithm based on word density is added to proofread the results of DCM model. According to last year's answer, the parameters in the formula of DCM document center model are adjusted, and the model is improved.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 周雅倩,郭以昆,黃萱菁,吳立德;基于最大熵方法的中英文基本名詞短語(yǔ)識(shí)別[J];計(jì)算機(jī)研究與發(fā)展;2003年03期
2 余正濤;毛存禮;鄧錦輝;章程;郭劍毅;;基于模式學(xué)習(xí)的中文問答系統(tǒng)答案抽取方法[J];吉林大學(xué)學(xué)報(bào)(工學(xué)版);2008年01期
3 付鴻鵠;張曉林;;段落檢索及其相關(guān)算法研究[J];現(xiàn)代圖書情報(bào)技術(shù);2007年02期
4 宗萍;施水才;王濤;呂學(xué)強(qiáng);;基于條件隨機(jī)場(chǎng)的英文地理行政實(shí)體識(shí)別[J];現(xiàn)代圖書情報(bào)技術(shù);2009年02期
5 姚天順,張俐,高竹;WordNet綜述[J];語(yǔ)言文字應(yīng)用;2001年01期
,本文編號(hào):1988123
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1988123.html
最近更新
教材專著