搜索引擎中命名實(shí)體查詢處理相關(guān)技術(shù)研究
發(fā)布時(shí)間:2018-06-12 03:53
本文選題:命名實(shí)體 + 查詢切分; 參考:《哈爾濱工業(yè)大學(xué)》2012年博士論文
【摘要】:當(dāng)前互聯(lián)網(wǎng)已經(jīng)成為人們獲取信息和進(jìn)行事務(wù)活動(dòng)的一個(gè)重要平臺(tái)。隨著互聯(lián)網(wǎng)上各種數(shù)據(jù)和應(yīng)用資源的快速增長(zhǎng),搜索引擎成為人們從海量的網(wǎng)上資源中快速準(zhǔn)確地獲取信息的必要工具。用戶通過提交查詢到搜索引擎表達(dá)他們的信息需求,搜索引擎則根據(jù)對(duì)查詢的分析提供給用戶需要的檢索結(jié)果,查詢是用戶和搜索引擎之間必要的信息傳遞方式。為了使搜索引擎能夠準(zhǔn)確地理解查詢中表達(dá)的信息需求,則需要開展查詢自動(dòng)分析處理技術(shù)的研究。 命名實(shí)體查詢是一類重要的查詢,在搜索引擎查詢中占有很高的比例,并且具有一些自身特點(diǎn),研究命名實(shí)體查詢的相關(guān)處理技術(shù)能夠使搜索引擎更好地分析用戶的檢索意圖,提供給用戶準(zhǔn)確的檢索結(jié)果,改善用戶的檢索體驗(yàn)。命名實(shí)體查詢處理技術(shù)通常包括獲取查詢中的語義片段,識(shí)別出查詢中包含的實(shí)體,分析命名實(shí)體查詢的檢索意圖等方面的研究。據(jù)此,本文從以下幾個(gè)方面開展了命名實(shí)體查詢處理的相關(guān)技術(shù)研究。 1、基于單語詞對(duì)齊模型的無指導(dǎo)查詢自動(dòng)切分。查詢切分是一項(xiàng)基礎(chǔ)和必要的查詢處理工作,是將查詢從字符序列切分出詞匯或短語等語義單元的過程。由于查詢中出現(xiàn)的詞匯規(guī)模巨大并且包含許多不規(guī)范的詞匯,有指導(dǎo)的方法需要人工標(biāo)注大量的訓(xùn)練語料,使其不能很好地適應(yīng)查詢切分的任務(wù)。本文提出了一種基于單語詞對(duì)齊模型的無指導(dǎo)查詢切分方法。該方法僅利用查詢?nèi)罩咀詣?dòng)訓(xùn)練查詢切分模型,并在模型中能夠結(jié)合字符的共現(xiàn)信息、位置信息以及繁殖度信息,獲得了較好的查詢切分效果。本文在查詢?cè)~項(xiàng)切分的基礎(chǔ)上進(jìn)一步對(duì)查詢進(jìn)行了層次化切分,將查詢表示為切分片段的樹狀結(jié)構(gòu),查詢層次化切分結(jié)果可以表示出查詢中哪些切分片段之間的關(guān)系更為緊密。實(shí)驗(yàn)結(jié)果顯示與已有的切分方法相比,本文方法獲得了更好的查詢切分效果。 2、基于圖上隨機(jī)游走模型的查詢?nèi)罩局忻麑?shí)體挖掘。查詢?nèi)罩臼且粋(gè)包含大量命名實(shí)體的數(shù)據(jù)資源。從查詢?nèi)罩局型诰虺龅拿麑?shí)體,更加符合用戶構(gòu)造查詢時(shí)使用命名實(shí)體的習(xí)慣,并且查詢?nèi)罩緯?huì)不斷更新,其中記錄了一些新出現(xiàn)的實(shí)體名稱,這使得研究查詢?nèi)罩局忻麑?shí)體挖掘?qū)τ谒阉饕嫣幚砻麑?shí)體查詢更具有實(shí)際意義。本文中采用了一種弱指導(dǎo)的方法進(jìn)行命名實(shí)體挖掘,其中利用了少量的屬于目標(biāo)類別的命名實(shí)體名稱作為種子,使用從查詢?nèi)罩局谐槿〕龅暮蜻x命名實(shí)體、查詢中命名實(shí)體的上下文模板以及用戶點(diǎn)擊URL構(gòu)造三分圖,采用圖上的隨機(jī)游走算法獲取目標(biāo)類別的命名實(shí)體。實(shí)驗(yàn)結(jié)果顯示,本文方法能夠有效結(jié)合查詢?nèi)罩局械拿麑?shí)體相關(guān)信息,提高查詢?nèi)罩局蝎@取命名實(shí)體的準(zhǔn)確率。 3、基于在線百科的命名實(shí)體同義屬性短語獲取。在命名實(shí)體的屬性短語中,描述實(shí)體同一屬性的不同表達(dá)形式的短語,被稱為同義屬性短語。獲取實(shí)體的同義屬性短語對(duì)命名實(shí)體查詢的檢索意圖分析將有所幫助。在命名實(shí)體查詢中,用戶通常使用屬性短語構(gòu)建查詢,,表達(dá)對(duì)實(shí)體屬性值的需求意圖。本文從在線百科中獲取命名實(shí)體的屬性短語,并采用了分類的框架結(jié)合了多種特征去識(shí)別出其中的同義屬性短語。據(jù)我們了解,本文方法是首次提出利用在線百科獲取同義屬性短語的研究。實(shí)驗(yàn)結(jié)果表明,在線百科是獲取實(shí)體同義屬性短語的有效資源,并且本文提出的方法能夠有效地獲取大量的同義屬性短語。 4、命名實(shí)體查詢的檢索意圖識(shí)別。在本文中包括基于分類的查詢檢索意圖識(shí)別和更細(xì)粒度的基于查詢檢索模式的檢索意圖識(shí)別兩個(gè)部分。查詢意圖分類可以限制檢索結(jié)果的類別空間,提高檢索準(zhǔn)確率。在查詢意圖分類中,采用融合多種資源信息的方法進(jìn)行分類,其中根據(jù)對(duì)查詢文本,查詢?nèi)罩疽约盎ヂ?lián)網(wǎng)檢索結(jié)果的分析,獲取了有效的查詢意圖分類特征。本文進(jìn)一步在查詢意圖分類模型識(shí)別出的信息類和事務(wù)類命名實(shí)體查詢中,抽取用戶經(jīng)常使用的查詢檢索模式,并將具有相似檢索意圖的查詢檢索模式進(jìn)行聚類。查詢檢索模式可以用來匹配用戶提交的查詢,幫助搜索引擎準(zhǔn)確地分析查詢的檢索意圖。本文中采用了基于圖模型方法和基于相似度方法級(jí)聯(lián)地進(jìn)行命名實(shí)體查詢的檢索模式獲取。實(shí)驗(yàn)結(jié)果顯示本文方法在多個(gè)實(shí)體類別上均有效地獲取了查詢檢索模式。 綜上所述,本文開展了命名實(shí)體查詢處理一些關(guān)鍵技術(shù)的研究工作,其中有些查詢處理技術(shù)出于更廣泛適應(yīng)性的考慮,其面向的對(duì)象不僅是命名實(shí)體查詢,也可以應(yīng)用到其他查詢上。在研究中取得了一些初步的結(jié)論和成果,希望能對(duì)搜索引擎的命名實(shí)體查詢處理任務(wù)有所裨益。
[Abstract]:The Internet has become an important platform for people to obtain information and conduct business activities. With the rapid growth of all kinds of data and application resources on the Internet, the search engine has become a necessary tool for people to obtain information quickly and accurately from the mass of online resources. Users have passed submission queries to the search engines to express them. The search engine provides the retrieval results to the users according to the analysis of the query. The query is the necessary way of information transfer between the user and the search engine. In order to make the search engine understand the information requirements expressed in the query, it needs to carry out the research of automatic query analysis and processing technology.
Named entity query is an important kind of query, which occupies a very high proportion in search engine query and has some own characteristics. Research on the related processing technology of named entity query can make the search engine better analyze the user's retrieval intention, provide the user with accurate retrieval results, improve the user's retrieval experience. The body query processing technology usually includes obtaining the semantic fragments in the query, identifying the entities contained in the query, and analyzing the search intention of the named entity query. Based on this, this paper has carried out the related technology research of the named entity query processing from the following aspects.
1, automatic segmentation of undirected query based on the single word alignment model. Query segmentation is a basic and necessary query processing. It is the process of dividing the semantic units such as words or phrases out of the sequence of characters. A large number of training materials are annotated artificially to make it difficult to adapt to the task of query segmentation. In this paper, an undirected query segmentation method based on the word alignment model is proposed. This method can automatically train query segmentation model by using query log, and can combine the concurrence information, location information and reproduction degree in the model. In this paper, a better query segmentation effect is obtained. In this paper, a hierarchical segmentation is carried out on the basis of the segmentation of query words. The query is expressed as the tree structure of the segmentation fragment. The query hierarchical segmentation results can show the close relation between the segmentation segments in the query. The experimental results show that the relationship between the segmentation fragments is more closely. Compared with the segmentation method, the proposed method achieves better query segmentation effect.
2, named entity mining in the query log based on the random walk model. The query log is a data resource containing a large number of named entities. The named entity mining from the query log is more consistent with the custom of using named entity when the user constructs the query, and the query daily chronicles are constantly updated, in which some new appearance is recorded. The name of the entity, which makes the study of naming entity mining in the query log more meaningful for the search engine to handle named entity queries. In this paper, a weak guidance method is used for naming entity mining, in which a small number of named entity names belonging to the target category are used as seeds and used from the query log. The candidate naming entity, the context template of the named entity in the query and the user clicking URL to construct the three partite graph, use the random walk algorithm on the graph to obtain the named entity of the target category. The experimental results show that this method can effectively combine the related information of the named real body in the query log and improve the name of the name in the query log. The accuracy of the body.
3, named entity synonymous attribute phrase based on online encyclopedia. In the attribute phrase of the named entity, the phrase describing the different expression of the entity's same attribute is called synonymous attribute phrase. It will help to analyze the retrieval intention of the named entity query by obtaining the entity's synonym phrase. We usually use the attribute phrase to construct the query to express the requirement intention of the entity attribute value. This paper obtains the attribute phrase of the named entity from the online encyclopedia, and uses the classification framework to combine a variety of features to identify the synonymous attribute phrases. According to our understanding, this method is the first time to use online encyclopedia to obtain synonyms. The experimental results show that the online encyclopedia is an effective resource for obtaining the entity synonymous attribute phrases, and the method proposed in this paper can effectively obtain a large number of synonymous attribute phrases.
4, the retrieval intention recognition of named entity query. In this article, two parts are included in the classification based query retrieval intention recognition and the more finer query retrieval pattern based on query retrieval mode. The query intention classification can limit the category space of the retrieval results and improve the retrieval accuracy. In the query intention classification, the fusion of multiple types is used. According to the analysis of query text, query log and the analysis of Internet retrieval results, the effective classification features of query intention are obtained. In this paper, the query retrieval mode used by users is extracted in the information class and transaction class named entity query identified by the query intention classification model. The query retrieval mode with similar retrieval intention is clustered. The query retrieval mode can be used to match the queries submitted by the user and help the search engine to accurately analyze the retrieval intention of the query. In this paper, the retrieval mode based on the graph model method and the similarity method cascaded into the named entity query is used. The results show that our method effectively retrieves the query retrieval mode in many entity categories.
To sum up, this paper has carried out the research work on some key technologies of named entity query processing. Some of the query processing technologies are not only named entity query but also applied to other queries for more extensive adaptability, and some preliminary conclusions and results are obtained in the study. The named entity query processing task of cable engine is beneficial.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 高文利;;軍備情報(bào)說明文的武器對(duì)象判定[J];軟件導(dǎo)刊;2010年02期
2 蔡愛杰;牟童;;基于Web的實(shí)體關(guān)系發(fā)現(xiàn)的研究[J];哈爾濱師范大學(xué)自然科學(xué)學(xué)報(bào);2010年05期
3 劉路;李弼程;張先飛;;基于向量相似度修正策略的命名實(shí)體關(guān)聯(lián)分析[J];計(jì)算機(jī)工程與應(yīng)用;2008年02期
4 潘淵;李弼程;張先飛;;一種基于自適應(yīng)重心向量的主題檢測(cè)方法[J];計(jì)算機(jī)工程;2009年03期
5 潘正高;侯傳宇;談成訪;;基于命名實(shí)體的Web新聞文本分類方法[J];合肥工業(yè)大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年08期
6 王睿,張潔,張由儀,于y
本文編號(hào):2008216
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2008216.html
最近更新
教材專著