搜索引擎中索引剪枝的研究

發(fā)布時(shí)間：2018-07-26 10:39

【摘要】：搜索引擎作為人們獲取網(wǎng)絡(luò)信息的主要入口，正在被越來越多的人使用。不斷增長的網(wǎng)頁數(shù)量和查詢請求量使得搜索引擎面臨著巨大的性能挑戰(zhàn)。通常，搜索引擎每秒需要在數(shù)百億的網(wǎng)頁數(shù)據(jù)上處理成千上萬的查詢。因此，，如何高效地處理查詢一直是搜索引擎和信息檢索領(lǐng)域中重要的研究問題。本文從索引剪枝的角度出發(fā)來研究提升查詢處理效率的方法。索引剪枝通常分為靜態(tài)索引剪枝和動(dòng)態(tài)索引剪枝的方法。靜態(tài)索引剪枝方法主要用在索引構(gòu)建階段。它在索引構(gòu)建時(shí)，去除索引中一些對查詢不重要的信息來縮短倒排鏈長度，減小倒排索引的大小，從而提升查詢的速度。動(dòng)態(tài)索引剪枝的方法主要用在查詢的處理階段。它在查詢的處理時(shí)，通過使用相應(yīng)的索引輔助結(jié)構(gòu)來跳過一些對查詢不重要的文檔來提升查詢的處理速度。本文分別從靜態(tài)索引剪枝和動(dòng)態(tài)索引剪枝兩方面來研究提升查詢處理的方法，并提出了一些新的索引結(jié)構(gòu)和算法來支持高效的查詢處理。本文的工作、創(chuàng)新性和主要貢獻(xiàn)主要體現(xiàn)在以下幾個(gè)方面： 1.本文提出了一種基本網(wǎng)頁結(jié)構(gòu)和重要度的靜態(tài)索引剪枝方法。在這種剪枝方法中，它一方面利用網(wǎng)頁的結(jié)構(gòu)來區(qū)別對待在不同結(jié)構(gòu)中信息的重要程度，優(yōu)先保留重要結(jié)構(gòu)中的信息。另一方面利用網(wǎng)頁與網(wǎng)頁之間的重要度不同，對重要度高的網(wǎng)頁保留更多比例的信息。通過在GOV2數(shù)據(jù)集上的相關(guān)實(shí)驗(yàn)，我們發(fā)現(xiàn)網(wǎng)頁結(jié)構(gòu)和重要度信息對文檔的剪枝都有非常大的幫助。結(jié)合這兩種信息的靜態(tài)索引剪枝方法在去除80%的倒排項(xiàng)時(shí)，P@10指標(biāo)與使用完整索引的效果相當(dāng)，存儲(chǔ)開銷減小了73%，查詢處理的速度提升了2倍。 2.本文提出了一種基于塊最大索引的動(dòng)態(tài)索引剪枝算法的處理框架。在此框架下，本文提出了3個(gè)新的動(dòng)態(tài)索引剪枝算法，分別為Block-MaxMaxScore (BMM)， Local Block-Max WAND (LBMW)和Local Block-MaxMaxScore (LBMM)。在這個(gè)框架下查詢處理主要分為三步：1).選擇候選文檔。本文提出的算法利用塊局部最大分?jǐn)?shù)而不是詞的全局最大分?jǐn)?shù)來選擇候選文檔，大大減少了選擇候選文檔的數(shù)量。2).過濾文檔。利用文檔所在塊最大分?jǐn)?shù)來過濾候選文檔。本文提出一種更高效的過濾方法，它能充分的利用查詢處理局部性，提升過濾的效率。3).候選文檔打分。本文提出的算法利用塊最大分?jǐn)?shù)來動(dòng)態(tài)估算文檔分?jǐn)?shù)上限，從而可以盡快的結(jié)束對不重要的候選文檔的打分。實(shí)驗(yàn)表明，新提出剪枝算法在查詢處理速度上比已有的算法都有明顯的提升，它比經(jīng)典的剪枝方法快近1倍。 3.本文提出了兩種在塊最大索引中存儲(chǔ)文檔靜態(tài)分?jǐn)?shù)與詞相關(guān)性分?jǐn)?shù)的方法。使得動(dòng)態(tài)索引剪枝算法的打分函數(shù)在包含文檔靜態(tài)分?jǐn)?shù)時(shí)也能對查詢進(jìn)行高效的剪枝。一種方法是對每個(gè)塊分別存儲(chǔ)塊中最大的詞相關(guān)性分?jǐn)?shù)和文檔靜態(tài)分?jǐn)?shù)；另一種方法是先把詞相關(guān)性分?jǐn)?shù)與文檔靜態(tài)分?jǐn)?shù)合并，然后存儲(chǔ)塊中最大合并分?jǐn)?shù)。第二種方法相對于第一種方法能更加精確地估算候選文檔的分?jǐn)?shù)，但它存在低估候選文檔分?jǐn)?shù)的情況。本文分析了產(chǎn)生文檔分?jǐn)?shù)低估的原因并給出了相應(yīng)的解決方法。同時(shí)，本文還討論了在不同文檔重排序下對動(dòng)態(tài)索引剪枝算法的影響。實(shí)驗(yàn)表明在打分公式中文檔靜態(tài)分?jǐn)?shù)比例較小時(shí)，按URL序重排序的索引查詢處理的速度比較快。當(dāng)文檔靜態(tài)分?jǐn)?shù)的比重增加時(shí)，則按文檔靜態(tài)分?jǐn)?shù)序排序的索引的查詢處理更有優(yōu)勢。 4.本文提出一種在區(qū)間最大索引中最優(yōu)化區(qū)間劃分的方法，并在這種區(qū)間劃分的方式下，提出了4種動(dòng)態(tài)索引剪枝算法：Space-Max WAND (SMW)，Space-Max MaxScore (SMM)，Space-Max AND (SMA)和Space Max AND withCheck (SMAC)。這種最優(yōu)化區(qū)間劃分的方法結(jié)合了文檔分?jǐn)?shù)估算誤差和區(qū)間自身的處理開銷等因素。它能在同等條件下對索引中的每個(gè)倒排鏈劃分出最優(yōu)的區(qū)間。利用區(qū)間最大的分?jǐn)?shù)，新提出剪枝算法能更加精確的估算出候選文檔的分?jǐn)?shù)，減少找到候選文檔的數(shù)量，同時(shí)對候選文檔打分也能更早的結(jié)束。實(shí)驗(yàn)表明，這種基于最優(yōu)化區(qū)間的方法的剪枝算法比根據(jù)倒排鏈長度來劃分區(qū)間的方法處理查詢的效率高。而基于區(qū)間最大索引的剪枝算法比基于塊最大索引的剪枝算法效率高。 5.在上述的研究成果的基礎(chǔ)上，重新設(shè)計(jì)并實(shí)現(xiàn)了天網(wǎng)搜索引擎。使得天網(wǎng)搜索引擎在索引量和查詢處理的效率都有非常大的提升。
[Abstract]:The search engine is being used by more and more people as the main entrance to network information. The increasing number of web pages and query requests make the search engine face huge performance challenges. Often, search engines need to manage thousands of queries per second on the hundreds of billions of web pages. So how efficient Query processing has always been an important research issue in search engine and information retrieval field.
This paper studies the method of improving the efficiency of query processing from the angle of index pruning. Index pruning is usually divided into static index pruning and dynamic index pruning. The static index pruning method is mainly used in the index construction phase. When the index is built, some unimportant information in the index are removed to shorten the length of the inverted chain length. Degree, reducing the size of the inverted index to improve the speed of the query. The method of dynamic index pruning is mainly used in the processing phase of the query. In the processing of the query, it uses the corresponding index auxiliary structure to jump some queries that are not important to the query to improve the query rate. This article cuts and moves from the static index respectively. State index pruning in two aspects to study the method of improving query processing, and put forward some new index structure and algorithm to support efficient query processing.
The innovation and main contributions of this work are mainly reflected in the following aspects:
1. this paper presents a static index pruning method for basic web page structure and importance. In this pruning method, it uses the structure of the web page to distinguish the importance of information in different structures and to retain information in important structure. On the other hand, it is important to make use of the importance of the web page to the web. The web page structure and importance information are very helpful to the pruning of the document. The static index pruning method combined with these two kinds of information has the same effect as the use of a complete index when removing 80% inverted items with the two information. The cost is reduced by 73% and the query processing speed is increased by 2 times.
2. this paper presents a framework for the processing of dynamic index pruning algorithms based on block maximum index. Under this framework, 3 new dynamic index pruning algorithms are proposed, which are Block-MaxMaxScore (BMM), Local Block-Max WAND (LBMW) and Local Block-MaxMaxScore (LBMM). In this framework, the query processing is divided into three steps. 1). Select candidate documents. The proposed algorithm uses a local maximum fraction of the block instead of the global maximum fraction of the word to select the candidate document, greatly reducing the number of selected candidate documents.2). Filter the document. Filter the candidate document using the maximum block score of the document. This paper proposes a more efficient filtering method, which can be fully used. Using the query processing locality to improve the efficiency of filtering.3). The candidate document is graded. The proposed algorithm uses the block maximum score to dynamically estimate the upper limit of the document fraction, so that the score of the unimportant candidate documents can be ended as soon as possible. The experiment shows that the new pruning algorithm is better than the existing algorithms in the query processing speed. Obviously, it is nearly 1 times faster than the classical pruning method.
3. this paper presents two methods to store the static fraction of the document and the correlation fraction of the word in the block maximum index. It makes the scoring function of the dynamic index pruning algorithm effectively prune the query when it contains the static fraction of the document. One method is to store the maximum word correlation score and the document static of each block respectively. The other method is to combine the word correlation fraction with the document static fraction first, and then store the maximum merge score in the block. The second method can estimate the score of the candidate document more accurately than the first method, but it underestimates the candidate document fraction. At the same time, this paper also discusses the influence of the dynamic index pruning algorithm under different reordering of the document. The experiment shows that the rate of the document static fraction in the scoring formula is small, and the speed of the index query processing according to the URL order reordering is faster. Query processing with index sorting is more advantageous.
4. this paper presents a method for the optimal interval division in the interval maximum index. In this way, 4 dynamic index pruning algorithms are proposed: Space-Max WAND (SMW), Space-Max MaxScore (SMM), Space-Max AND (SMA) and Space Max AND. The optimal interval of each inverted chain in the index can be divided under the same condition. Using the maximum fraction of the interval, the new pruning algorithm can estimate the score of the candidate document more accurately, reduce the number of candidate documents, and score the candidate documents at the same time. The experiment shows that the pruning algorithm based on the optimal interval method is more efficient than the method based on the inverted chain length to divide the interval, and the pruning algorithm based on the maximum index of the interval is more efficient than the pruning algorithm based on the largest index of the block.
5. on the basis of the above research results, the Skynet search engine is redesigned and implemented, which makes the search engine in the sky network greatly improve the efficiency of index and query processing.
【學(xué)位授予單位】：北京大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 葉千軍;吳人珊;;檢索系統(tǒng)設(shè)計(jì)中若干重要選擇的研究(三)[J];大學(xué)圖書館學(xué)報(bào);1989年06期

2 藍(lán)海洋,周杰韓,張和明;文本索引詞項(xiàng)相對權(quán)重計(jì)算方法與應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2003年15期

3 陳莉;淺談古籍書目索引的編纂[J];圖書情報(bào)知識(shí);2005年03期

4 陳莉;韓錫鐸;;淺談古籍書目索引的編纂[J];中國索引;2004年04期

5 張新鳳;;SciFinder Scholar數(shù)據(jù)庫醫(yī)院圖書館學(xué)研究文獻(xiàn)內(nèi)容分析[J];醫(yī)學(xué)信息學(xué)雜志;2009年11期

6 胡小菁;情報(bào)檢索語言語法手段分析[J];上海第二工業(yè)大學(xué)學(xué)報(bào);1991年01期

7 子皿 ,邁至科;搜索引擎探密(上)[J];中國計(jì)算機(jī)用戶;1997年01期

8 陶躍華;趙波;楊秀國;;搜索引擎的文檔預(yù)處理技術(shù)研究[J];計(jì)算機(jī)科學(xué);2002年07期

9 陸小麗;何加銘;;基于Map/Reduce的索引數(shù)據(jù)云存儲(chǔ)模型研究[J];寧波大學(xué)學(xué)報(bào)(理工版);2011年03期

10 劉丹;利用《CA on CD》光盤數(shù)據(jù)庫查找信息資源[J];大學(xué)化學(xué);2001年04期

相關(guān)會(huì)議論文前10條

1 彭軻;廖聞劍;;淺析搜索引擎[A];中國通信學(xué)會(huì)第五屆學(xué)術(shù)年會(huì)論文集[C];2008年

2 李丹;;如何利用搜索引擎查找中醫(yī)藥信息[A];中國中醫(yī)藥信息研究會(huì)第二屆理事大會(huì)暨學(xué)術(shù)交流會(huì)議論文匯編[C];2003年

3 鄧長壽;郭景峰;楊焱林;鄧安遠(yuǎn);;下一代Web搜索引擎初探[A];第十八屆全國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2001年

4 維尼拉·木沙江;吐爾洪·吾司曼;;維、哈、柯文搜索引擎中網(wǎng)頁爬行器的設(shè)計(jì)與實(shí)現(xiàn)[A];少數(shù)民族青年自然語言處理技術(shù)研究與進(jìn)展——第三屆全國少數(shù)民族青年自然語言信息處理、第二屆全國多語言知識(shí)庫建設(shè)聯(lián)合學(xué)術(shù)研討會(huì)論文集[C];2010年

5 孫斌;;使用內(nèi)存匯集的新聞搜索索引更新[A];第四屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集（上）[C];2008年

6 湯薇;曾艷;;構(gòu)建校園網(wǎng)搜索引擎必要性分析[A];廣西計(jì)算機(jī)學(xué)會(huì)2008年年會(huì)論文集[C];2008年

7 姚樹宇;趙少東;;一種使用分布式技術(shù)的搜索引擎[A];2005年全國開放式分布與并行計(jì)算學(xué)術(shù)會(huì)議論文集[C];2005年

8 倪俊峰;;基于黃頁搜索引擎的關(guān)鍵字排名廣告系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[A];2005年中國索引學(xué)會(huì)年會(huì)暨學(xué)術(shù)研討會(huì)論文集[C];2005年

9 張怡;查貴庭;;SEO在信息服務(wù)中的應(yīng)用研究[A];2010年中國索引學(xué)會(huì)年會(huì)暨學(xué)術(shù)研討會(huì)論文集[C];2010年

10 陳援非;何哲;朱珍民;;基于普適計(jì)算的個(gè)性化搜索技術(shù)[A];第二屆和諧人機(jī)環(huán)境聯(lián)合學(xué)術(shù)會(huì)議(HHME2006)——第2屆中國普適計(jì)算學(xué)術(shù)會(huì)議(PCC'06)論文集[C];2006年

相關(guān)重要報(bào)紙文章前10條

1 李一鑫;搜索排名的紅與黑[N];財(cái)經(jīng)時(shí)報(bào);2007年

2 周文林;搜狗3.0能否撼動(dòng)搜索市場[N];經(jīng)濟(jì)參考報(bào);2007年

3 惠正一;比爾·蓋茨:微軟不怕Google[N];第一財(cái)經(jīng)日報(bào);2005年

4 賽迪顧問股份有限公司互聯(lián)網(wǎng)與電子商務(wù)咨詢中心常燕杰;搜索，還是門戶[N];中國計(jì)算機(jī)報(bào);2005年

5 陳珊;浙江移動(dòng)推出手機(jī)搜索引擎服務(wù)[N];人民郵電;2005年

6 趙法忠;搜索引擎還需悠著點(diǎn)[N];中國經(jīng)營報(bào);2005年

7 金朝力;搜索引擎火拼搜索質(zhì)量[N];北京商報(bào);2006年

8 本報(bào)記者　趙曉輝孟昭麗;搜索引擎駛?cè)搿氨茱L(fēng)港”[N];中國證券報(bào);2006年

9 孫t;搜索引擎驚喜侵權(quán)官司止于“避風(fēng)港”？[N];第一財(cái)經(jīng)日報(bào);2006年

10 姜蕊;問天下誰識(shí)搜索？[N];中國高新技術(shù)產(chǎn)業(yè)導(dǎo)報(bào);2006年

相關(guān)博士學(xué)位論文前10條

1 單棟棟;搜索引擎中索引剪枝的研究[D];北京大學(xué);2013年

2 岑榮偉;基于用戶行為分析的搜索引擎評價(jià)研究[D];清華大學(xué);2010年

3 李群;主題搜索引擎聚類算法的研究[D];北京林業(yè)大學(xué);2011年

4 蘇君華;面向搜索引擎的技術(shù)接受模型研究[D];南京大學(xué);2011年

5 劉佐達(dá);分布協(xié)作式搜索引擎模型及算法研究[D];清華大學(xué);2011年

6 陳旭毅;基于索引云的企業(yè)搜索引擎實(shí)現(xiàn)研究[D];武漢大學(xué);2011年

7 郭眈;中文互聯(lián)網(wǎng)視頻搜索引擎系統(tǒng)策略研究[D];北京交通大學(xué);2012年

8 王昤璞;基于用戶體驗(yàn)的互聯(lián)網(wǎng)搜索引擎醫(yī)學(xué)信息檢索可用性評估研究[D];吉林大學(xué);2010年

9 李莎莎;面向搜索引擎的自然語言處理關(guān)鍵技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年

10 白玉琪;空間信息搜索引擎研究[D];中國科學(xué)院研究生院（遙感應(yīng)用研究所）;2003年

相關(guān)碩士學(xué)位論文前10條

1 邢科;基于GPU的索引構(gòu)建方法研究[D];華中科技大學(xué);2012年

2 薛云;Internet上元搜索引擎的研究與設(shè)計(jì)[D];太原理工大學(xué);2003年

3 王春花;基于Nutch的農(nóng)業(yè)搜索引擎檢索結(jié)果排序策略的研究[D];西北農(nóng)林科技大學(xué);2010年

4 李雷;基于Nutch的農(nóng)業(yè)信息搜索引擎實(shí)現(xiàn)和優(yōu)化[D];吉林大學(xué);2011年

5 董晨;基于模糊聚類的個(gè)性化搜索引擎的研究[D];福州大學(xué);2005年

6 封俊;基于Hadoop的分布式搜索引擎研究與實(shí)現(xiàn)[D];太原理工大學(xué);2010年

7 李浩;分布式教育網(wǎng)信息檢索系統(tǒng)的研究和實(shí)現(xiàn)[D];華南理工大學(xué);2010年

8 尉建興;基于Lucene搜索引擎的研究與應(yīng)用[D];太原理工大學(xué);2011年

9 李建平;智能化WEB信息搜索引擎的研究與實(shí)現(xiàn)[D];大慶石油學(xué)院;2003年

10 田生偉;基于涉農(nóng)詞典的搜索引擎的研究與實(shí)踐[D];新疆大學(xué);2004年

本文編號(hào)：2145740

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2145740.html

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

搜索引擎中索引剪枝的研究