分布式搜索引擎系統(tǒng)的分發(fā)調(diào)度與融合排序
發(fā)布時(shí)間:2018-12-31 20:50
【摘要】:隨著互聯(lián)網(wǎng)的發(fā)展,,網(wǎng)頁(yè)信息已經(jīng)呈爆炸式的增長(zhǎng)。在資金和設(shè)備有限的情況下,很多檢索系統(tǒng)只能獨(dú)立完成某領(lǐng)域或某方面的資源索引與檢索,很難將檢索系統(tǒng)覆蓋到全網(wǎng)。分布式信息檢索則提供一種解決方法,它作為一種分布式的架構(gòu),能夠有效利用各分布的閑散的資源來(lái)提供信息檢索服務(wù)。 分布式信息檢索主要是指在分布式的環(huán)境中,利用分布式計(jì)算和移動(dòng)代理等技術(shù)從大量的、異構(gòu)的信息資源中檢索出對(duì)用戶有用的信息的過程。然而由于不同的信息資源具有不同的數(shù)據(jù)存儲(chǔ)結(jié)構(gòu)和檢索策略,分布式搜索系統(tǒng)的關(guān)鍵技術(shù)問題包括:如何實(shí)現(xiàn)資源的內(nèi)容描述,并根據(jù)描述和查詢的比較選擇資源結(jié)點(diǎn),即查詢的分發(fā)和結(jié)點(diǎn)調(diào)度問題;如何把來(lái)自不同資源節(jié)點(diǎn)的文檔列表合并,即檢索結(jié)果的融合和排序問題。 本文闡述了“下一代互聯(lián)網(wǎng)分布式搜索引擎系統(tǒng)”的設(shè)計(jì)思想和實(shí)現(xiàn)細(xì)節(jié),在這個(gè)系統(tǒng)的基礎(chǔ)上對(duì)上述分發(fā)調(diào)度和融合兩個(gè)問題進(jìn)行研究,給出在這個(gè)系統(tǒng)上的解決方案,并在系統(tǒng)中實(shí)現(xiàn)。對(duì)于分發(fā)調(diào)度策略,本文提出首先通過特征詞和隨機(jī)+高頻詞采樣兩種方式來(lái)獲得資源描述信息,然后結(jié)合資源描述和歷史檢索信息對(duì)資源評(píng)分和選擇;對(duì)于融合排序策略,本文結(jié)合應(yīng)用需求提出了相似度原則和多元化原則,并綜合這兩個(gè)原則制定出與以往算法策略側(cè)重點(diǎn)不同的融合排序策略。 本文對(duì)提出的兩個(gè)策略在系統(tǒng)上進(jìn)行了實(shí)驗(yàn)評(píng)測(cè),給出了系統(tǒng)在使用策略前后的實(shí)驗(yàn)對(duì)比數(shù)據(jù)和分析,結(jié)果表明本文所給出的分發(fā)調(diào)度和融合排序策略使得系統(tǒng)在檢索結(jié)果的召回率和查準(zhǔn)率方面都得到了提高,并保證了檢索結(jié)果的多樣性。
[Abstract]:With the development of the Internet, web information has been explosive growth. Under the condition of limited funds and equipments, many retrieval systems can only complete the index and retrieval of resources in one field or another independently, and it is very difficult to cover the whole network with the retrieval system. Distributed information retrieval provides a solution. As a distributed architecture, it can effectively utilize the idle resources of each distribution to provide information retrieval services. Distributed information retrieval mainly refers to the process of retrieving useful information for users from a large number of heterogeneous information resources using distributed computing and mobile agent technologies in a distributed environment. However, because different information resources have different data storage structures and retrieval strategies, the key technical problems of distributed search system include: how to realize the content description of resources, and how to select resource nodes according to the comparison between description and query. That is, query distribution and node scheduling; How to merge the list of documents from different resource nodes, that is, the fusion and sorting of retrieval results. This paper describes the design idea and implementation details of the "next Generation Internet distributed search engine system". On the basis of this system, the above two issues of distribution scheduling and fusion are studied, and the solution on this system is given. And realized in the system. For the distribution scheduling strategy, this paper proposes two ways to obtain resource description information: feature words and random high-frequency word sampling, and then score and select resources by combining resource description and historical retrieval information. For the fusion ranking strategy, this paper proposes the similarity principle and the diversity principle according to the application requirements, and combines these two principles to work out the fusion ranking strategy which is different from the previous algorithm strategy. In this paper, the experimental evaluation of the two strategies is carried out on the system, and the comparative data and analysis of the system before and after the use of the strategy are given. The results show that the proposed distribution scheduling and fusion scheduling strategies can improve the recall and precision of retrieval results and ensure the diversity of retrieval results.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
本文編號(hào):2397094
[Abstract]:With the development of the Internet, web information has been explosive growth. Under the condition of limited funds and equipments, many retrieval systems can only complete the index and retrieval of resources in one field or another independently, and it is very difficult to cover the whole network with the retrieval system. Distributed information retrieval provides a solution. As a distributed architecture, it can effectively utilize the idle resources of each distribution to provide information retrieval services. Distributed information retrieval mainly refers to the process of retrieving useful information for users from a large number of heterogeneous information resources using distributed computing and mobile agent technologies in a distributed environment. However, because different information resources have different data storage structures and retrieval strategies, the key technical problems of distributed search system include: how to realize the content description of resources, and how to select resource nodes according to the comparison between description and query. That is, query distribution and node scheduling; How to merge the list of documents from different resource nodes, that is, the fusion and sorting of retrieval results. This paper describes the design idea and implementation details of the "next Generation Internet distributed search engine system". On the basis of this system, the above two issues of distribution scheduling and fusion are studied, and the solution on this system is given. And realized in the system. For the distribution scheduling strategy, this paper proposes two ways to obtain resource description information: feature words and random high-frequency word sampling, and then score and select resources by combining resource description and historical retrieval information. For the fusion ranking strategy, this paper proposes the similarity principle and the diversity principle according to the application requirements, and combines these two principles to work out the fusion ranking strategy which is different from the previous algorithm strategy. In this paper, the experimental evaluation of the two strategies is carried out on the system, and the comparative data and analysis of the system before and after the use of the strategy are given. The results show that the proposed distribution scheduling and fusion scheduling strategies can improve the recall and precision of retrieval results and ensure the diversity of retrieval results.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 張強(qiáng)弓,喻國(guó)寶,廖湖聲,隋樹林;一種元搜索引擎的查詢結(jié)果處理模型[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年S1期
相關(guān)碩士學(xué)位論文 前1條
1 李浩;分布式教育網(wǎng)信息檢索系統(tǒng)的研究和實(shí)現(xiàn)[D];華南理工大學(xué);2010年
本文編號(hào):2397094
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2397094.html
最近更新
教材專著