搜索引擎中的索引壓縮和查詢問題研究

發(fā)布時間：2018-04-20 02:19

本文選題：倒排索引 + 索引壓縮��；參考：《國防科學(xué)技術(shù)大學(xué)》2015年博士論文

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展和互聯(lián)網(wǎng)應(yīng)用的不斷普及,互聯(lián)網(wǎng)資源成為當(dāng)前規(guī)模最大、內(nèi)容最豐富、使用最廣泛的信息來源。為了有效地從這些海量數(shù)據(jù)中檢索到需要的信息,搜索引擎已經(jīng)成為向用戶提供快速資源定位的最好技術(shù)手段。然而,不斷增長的網(wǎng)頁規(guī)模和查詢請求使得搜索引擎面臨著巨大的性能挑戰(zhàn)。面對海量的網(wǎng)頁數(shù)據(jù)和巨大的查詢需求,如何高效地處理查詢請求成為搜索引擎領(lǐng)域中最重要的研究問題之一。在查詢性能上的任何提升都可以顯著地降低系統(tǒng)硬件資源的投入并提升用戶的查詢體驗,從而為搜索引擎的良好運行提供堅實的基礎(chǔ)。因此,本文主要研究提高搜索引擎效率的方法,并重點從倒排索引壓縮和查詢兩方面技術(shù)結(jié)合的角度出發(fā)來解決制約搜索引擎系統(tǒng)性能的問題。本文的主要研究成果如下:(1)為了提升搜索引擎系統(tǒng)索引數(shù)據(jù)的壓縮性能,本文研究了典型的Simple9索引壓縮算法。針對Simple9壓縮算法中存在的填充模式間可壓縮整數(shù)個數(shù)的稀疏問題,本文提出了密集填充技術(shù)(Dense Padding)來使一個32位機器字能夠壓縮更多的整數(shù),從而提升Simple9索引壓縮算法的壓縮率。當(dāng)某一填充模式中異常值的相對位置大于下一個填充模式的可壓縮整數(shù)個數(shù)時,我們通過在被壓縮整數(shù)序列末尾插入整數(shù)0來構(gòu)造滿足本填充模式的可壓縮整數(shù)序列。在解壓過程中,我們設(shè)計巧妙的算法來去除額外的整數(shù)0。此外,針對Simple9中可壓縮數(shù)字序列個數(shù)較少的導(dǎo)致的位操作過多問題,本文提出了分組Simple9壓縮技術(shù)(GroupSimple9)來降低壓縮和解壓縮過程中的數(shù)據(jù)讀取、分支判斷、移位和查找映射表等位操作,從而提升Simple9壓縮算法的壓縮和解壓速度。實驗結(jié)果表明,本文提出的分組Simple9和密集Simple9(DenseSimple9)壓縮算法能夠在壓縮率、壓縮和解壓三方面上均有明顯的性能提升。(2)為了提升搜索引擎系統(tǒng)索引數(shù)據(jù)的查詢性能,本文研究了當(dāng)前流行的MaxScore動態(tài)剪枝算法。由于當(dāng)前MaxScore查詢算法是在必要表(Essential Lists)中進行選擇候選文檔,非必要表(Non-essential Lists)的倒排項僅進行分?jǐn)?shù)貢獻累加。MaxScore查詢算法對必要表的訪問是順序進行的,而僅僅對非必要表采用隨機訪問,這種情況下MaxScore查詢算法存在著嚴(yán)重的候選文檔失效問題。針對這一問題,本文首先實現(xiàn)了一種多層自索引結(jié)構(gòu)(Hierarchical Self-index)來支持倒排鏈表的隨機訪問,然后提出對候選文檔最大可能分?jǐn)?shù)進行雙向檢查,實現(xiàn)了必要表的快速跳躍訪問(Essential Lists Skipping,ELS)。這樣實現(xiàn)必要表中候選文檔的預(yù)先檢測和隨機訪問,使得候選文檔的打分失效問題能夠盡早被發(fā)現(xiàn),避免在將要失效的候選文檔上的一些無用的操作。實驗結(jié)果表明,本文提出的ELS-MaxScore查詢算法能夠大大提升top-k查詢性能,尤其在查詢長度小于4時能達到MaxScore近2倍的性能。(3)通過之前的分析可以發(fā)現(xiàn)提升搜索引擎查詢性能的一個方法是,候選文檔的選擇應(yīng)該主要集中在重要度較高的詞項對應(yīng)的倒排鏈表中。在分析WAND查詢算法問題和研究激進式MaxScore(Aggressive MaxScore)優(yōu)勢的基礎(chǔ)上,為了更好的利用詞項重要度這一重要特性,本文提出了最大重要度優(yōu)先(Largest Score First,LSF)查詢算法。LSF查詢算法使得具有較高重要度的查詢詞項所指向的倒排鏈表能夠優(yōu)先得到處理。分析表明,LSF查詢算法能夠解決當(dāng)前(Document At A Time,DAAT)和(Term At A Time,TAAT)兩種窮盡遍歷算法中存在的候選文檔隨機選擇和內(nèi)存消耗問題。為了進一步支持高效的動態(tài)剪枝算法,本文利用LSF查詢的對于詞項重要度考慮的優(yōu)勢,提出了兩種精確的動態(tài)剪枝算法:基于LSF的去除查詢詞項技術(shù)(Term Omitting,LSF_TO)和基于LSF的文檔部分打分技術(shù)(Partial Scoring,LSF_PS)�；赥REC GOV2上的實驗結(jié)果表明,本文提出的兩種基于LSF索引遍歷的動態(tài)剪枝算法能夠達到比基于DAAT索引遍歷的WAND和MaxScore兩種動態(tài)剪枝算法更好的查詢性能。
[Abstract]:With the rapid development of Internet technology and the continuous popularization of Internet applications, Internet resources have become the largest, most abundant and most widely used information sources. In order to effectively retrieve the required information from these massive data, search engines have become the best technical means to provide users with rapid resource location. However, the growing web page size and query request make the search engine face huge performance challenges. How to efficiently handle query requests is one of the most important research issues in the search engine field. Any enhancement in query performance can be significantly reduced in the face of massive web data and huge query requirements. The input of the system hardware resources and the improvement of the user's query experience provide a solid foundation for the good operation of the search engine. Therefore, this paper mainly studies the methods to improve the efficiency of the search engine, and focuses on the problem of solving the performance of the search engine system from the angle of the combination of two aspects of inverted index compression and query. The main achievements of this paper are as follows: (1) in order to improve the compression performance of the index data of the search engine, a typical Simple9 index compression algorithm is studied in this paper. In this paper, a dense filling technique (Dense Padding) is proposed to make a 32 bit in order to solve the sparse problem of the number of compressible integers among the filling modes in the compression algorithm. The machine word can compress more integers and improve the compression rate of the Simple9 index compression algorithm. When the relative position of the exception value in a filling mode is larger than the number of compressible integers in the next filling mode, we construct the compressible integer sequence full of full full fill mode by inserting the integer 0 at the end of the compressed integer sequence. In the process of decompression, we design ingenious algorithms to remove additional integer 0.. In addition, we propose packet Simple9 compression (GroupSimple9) to reduce data reading, branch judgment, shift and lookup in the process of compression and decompression in order to reduce the number of bit operations caused by fewer compressible numeric sequences in Simple9. In order to improve the compression and decompression speed of the Simple9 compression algorithm, the compression and decompression speed of the Simple9 compression algorithm is improved by mapping the equal bit operation. The experimental results show that the proposed packet Simple9 and dense Simple9 (DenseSimple9) compression algorithm can significantly improve the performance of the compression rate, compression and decompression. (2) to improve the query of the index data of the search engine system Yes, this paper studies the popular MaxScore dynamic pruning algorithm. Since the current MaxScore query algorithm is selected in the necessary table (Essential Lists), the inverted item of the unnecessary table (Non-essential Lists) is only carried out by the fractional contribution accumulating.MaxScore query algorithm to the necessary table. In this case, the MaxScore query algorithm has a serious problem of candidate document failure. In this case, a multi-layer self indexing structure (Hierarchical Self-index) is first implemented to support the random access of the inverted list, and then the maximum possible fraction of the candidate documents is detected by two-way detection. The quick jump access (Essential Lists Skipping, ELS) of the necessary table is implemented. This enables the pre detection and random access of candidate documents in the necessary table so that the problem of the score failure of the candidate documents can be discovered as soon as possible and avoids some useless operations on the candidate documents to be invalid. The experimental results show that this article is proposed in this paper. The ELS-MaxScore query algorithm can greatly improve the performance of Top-k query, especially when the query length is less than 4. (3) one way to find the performance of the search engine query is that the selection of candidate documents should be mainly concentrated in the inverted chain corresponding to the more important words. On the basis of analyzing the problem of WAND query algorithm and studying the advantages of radical MaxScore (Aggressive MaxScore), in order to make better use of the important character of word item, this paper proposes the maximum important degree priority (Largest Score First, LSF) query algorithm, which makes the query words with higher importance point to point. The analysis shows that the LSF query algorithm can solve the problem of random selection and memory consumption of candidate documents in the two exhaustive traversal algorithms (Document At A Time, DAAT) and (Term At A Time, TAAT). In order to further support the efficient dynamic pruning algorithm, this paper uses a LSF query for the words. Two precise dynamic pruning algorithms: Term Omitting, LSF_TO and Partial Scoring (LSF_PS) based on LSF. Based on the experimental results on TREC GOV2, two kinds of dynamic pruning algorithms based on LSF index traversal are proposed in this paper. Query performance reach based on DAAT index traversal of the WAND and the MaxScore two kind of dynamic pruning algorithm is better.

【學(xué)位授予單位】：國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2015
【分類號】：TP391.3

【相似文獻】

相關(guān)博士學(xué)位論文前1條

1 姜琨;搜索引擎中的索引壓縮和查詢問題研究[D];國防科學(xué)技術(shù)大學(xué);2015年

，

本文編號：1775806

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1775806.html

上一篇：基于本體論相互屬性的Web資源元數(shù)據(jù)模型
下一篇：主題搜索引擎中網(wǎng)絡(luò)爬蟲的實現(xiàn)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

搜索引擎中的索引壓縮和查詢問題研究