天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

搜索引擎中的索引壓縮和查詢問(wèn)題研究

發(fā)布時(shí)間:2018-04-20 02:19

  本文選題:倒排索引 + 索引壓縮。 參考:《國(guó)防科學(xué)技術(shù)大學(xué)》2015年博士論文


【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展和互聯(lián)網(wǎng)應(yīng)用的不斷普及,互聯(lián)網(wǎng)資源成為當(dāng)前規(guī)模最大、內(nèi)容最豐富、使用最廣泛的信息來(lái)源。為了有效地從這些海量數(shù)據(jù)中檢索到需要的信息,搜索引擎已經(jīng)成為向用戶提供快速資源定位的最好技術(shù)手段。然而,不斷增長(zhǎng)的網(wǎng)頁(yè)規(guī)模和查詢請(qǐng)求使得搜索引擎面臨著巨大的性能挑戰(zhàn)。面對(duì)海量的網(wǎng)頁(yè)數(shù)據(jù)和巨大的查詢需求,如何高效地處理查詢請(qǐng)求成為搜索引擎領(lǐng)域中最重要的研究問(wèn)題之一。在查詢性能上的任何提升都可以顯著地降低系統(tǒng)硬件資源的投入并提升用戶的查詢體驗(yàn),從而為搜索引擎的良好運(yùn)行提供堅(jiān)實(shí)的基礎(chǔ)。因此,本文主要研究提高搜索引擎效率的方法,并重點(diǎn)從倒排索引壓縮和查詢兩方面技術(shù)結(jié)合的角度出發(fā)來(lái)解決制約搜索引擎系統(tǒng)性能的問(wèn)題。本文的主要研究成果如下:(1)為了提升搜索引擎系統(tǒng)索引數(shù)據(jù)的壓縮性能,本文研究了典型的Simple9索引壓縮算法。針對(duì)Simple9壓縮算法中存在的填充模式間可壓縮整數(shù)個(gè)數(shù)的稀疏問(wèn)題,本文提出了密集填充技術(shù)(Dense Padding)來(lái)使一個(gè)32位機(jī)器字能夠壓縮更多的整數(shù),從而提升Simple9索引壓縮算法的壓縮率。當(dāng)某一填充模式中異常值的相對(duì)位置大于下一個(gè)填充模式的可壓縮整數(shù)個(gè)數(shù)時(shí),我們通過(guò)在被壓縮整數(shù)序列末尾插入整數(shù)0來(lái)構(gòu)造滿足本填充模式的可壓縮整數(shù)序列。在解壓過(guò)程中,我們?cè)O(shè)計(jì)巧妙的算法來(lái)去除額外的整數(shù)0。此外,針對(duì)Simple9中可壓縮數(shù)字序列個(gè)數(shù)較少的導(dǎo)致的位操作過(guò)多問(wèn)題,本文提出了分組Simple9壓縮技術(shù)(GroupSimple9)來(lái)降低壓縮和解壓縮過(guò)程中的數(shù)據(jù)讀取、分支判斷、移位和查找映射表等位操作,從而提升Simple9壓縮算法的壓縮和解壓速度。實(shí)驗(yàn)結(jié)果表明,本文提出的分組Simple9和密集Simple9(DenseSimple9)壓縮算法能夠在壓縮率、壓縮和解壓三方面上均有明顯的性能提升。(2)為了提升搜索引擎系統(tǒng)索引數(shù)據(jù)的查詢性能,本文研究了當(dāng)前流行的MaxScore動(dòng)態(tài)剪枝算法。由于當(dāng)前MaxScore查詢算法是在必要表(Essential Lists)中進(jìn)行選擇候選文檔,非必要表(Non-essential Lists)的倒排項(xiàng)僅進(jìn)行分?jǐn)?shù)貢獻(xiàn)累加。MaxScore查詢算法對(duì)必要表的訪問(wèn)是順序進(jìn)行的,而僅僅對(duì)非必要表采用隨機(jī)訪問(wèn),這種情況下MaxScore查詢算法存在著嚴(yán)重的候選文檔失效問(wèn)題。針對(duì)這一問(wèn)題,本文首先實(shí)現(xiàn)了一種多層自索引結(jié)構(gòu)(Hierarchical Self-index)來(lái)支持倒排鏈表的隨機(jī)訪問(wèn),然后提出對(duì)候選文檔最大可能分?jǐn)?shù)進(jìn)行雙向檢查,實(shí)現(xiàn)了必要表的快速跳躍訪問(wèn)(Essential Lists Skipping,ELS)。這樣實(shí)現(xiàn)必要表中候選文檔的預(yù)先檢測(cè)和隨機(jī)訪問(wèn),使得候選文檔的打分失效問(wèn)題能夠盡早被發(fā)現(xiàn),避免在將要失效的候選文檔上的一些無(wú)用的操作。實(shí)驗(yàn)結(jié)果表明,本文提出的ELS-MaxScore查詢算法能夠大大提升top-k查詢性能,尤其在查詢長(zhǎng)度小于4時(shí)能達(dá)到MaxScore近2倍的性能。(3)通過(guò)之前的分析可以發(fā)現(xiàn)提升搜索引擎查詢性能的一個(gè)方法是,候選文檔的選擇應(yīng)該主要集中在重要度較高的詞項(xiàng)對(duì)應(yīng)的倒排鏈表中。在分析WAND查詢算法問(wèn)題和研究激進(jìn)式MaxScore(Aggressive MaxScore)優(yōu)勢(shì)的基礎(chǔ)上,為了更好的利用詞項(xiàng)重要度這一重要特性,本文提出了最大重要度優(yōu)先(Largest Score First,LSF)查詢算法。LSF查詢算法使得具有較高重要度的查詢?cè)~項(xiàng)所指向的倒排鏈表能夠優(yōu)先得到處理。分析表明,LSF查詢算法能夠解決當(dāng)前(Document At A Time,DAAT)和(Term At A Time,TAAT)兩種窮盡遍歷算法中存在的候選文檔隨機(jī)選擇和內(nèi)存消耗問(wèn)題。為了進(jìn)一步支持高效的動(dòng)態(tài)剪枝算法,本文利用LSF查詢的對(duì)于詞項(xiàng)重要度考慮的優(yōu)勢(shì),提出了兩種精確的動(dòng)態(tài)剪枝算法:基于LSF的去除查詢?cè)~項(xiàng)技術(shù)(Term Omitting,LSF_TO)和基于LSF的文檔部分打分技術(shù)(Partial Scoring,LSF_PS)。基于TREC GOV2上的實(shí)驗(yàn)結(jié)果表明,本文提出的兩種基于LSF索引遍歷的動(dòng)態(tài)剪枝算法能夠達(dá)到比基于DAAT索引遍歷的WAND和MaxScore兩種動(dòng)態(tài)剪枝算法更好的查詢性能。
[Abstract]:With the rapid development of Internet technology and the continuous popularization of Internet applications, Internet resources have become the largest, most abundant and most widely used information sources. In order to effectively retrieve the required information from these massive data, search engines have become the best technical means to provide users with rapid resource location. However, the growing web page size and query request make the search engine face huge performance challenges. How to efficiently handle query requests is one of the most important research issues in the search engine field. Any enhancement in query performance can be significantly reduced in the face of massive web data and huge query requirements. The input of the system hardware resources and the improvement of the user's query experience provide a solid foundation for the good operation of the search engine. Therefore, this paper mainly studies the methods to improve the efficiency of the search engine, and focuses on the problem of solving the performance of the search engine system from the angle of the combination of two aspects of inverted index compression and query. The main achievements of this paper are as follows: (1) in order to improve the compression performance of the index data of the search engine, a typical Simple9 index compression algorithm is studied in this paper. In this paper, a dense filling technique (Dense Padding) is proposed to make a 32 bit in order to solve the sparse problem of the number of compressible integers among the filling modes in the compression algorithm. The machine word can compress more integers and improve the compression rate of the Simple9 index compression algorithm. When the relative position of the exception value in a filling mode is larger than the number of compressible integers in the next filling mode, we construct the compressible integer sequence full of full full fill mode by inserting the integer 0 at the end of the compressed integer sequence. In the process of decompression, we design ingenious algorithms to remove additional integer 0.. In addition, we propose packet Simple9 compression (GroupSimple9) to reduce data reading, branch judgment, shift and lookup in the process of compression and decompression in order to reduce the number of bit operations caused by fewer compressible numeric sequences in Simple9. In order to improve the compression and decompression speed of the Simple9 compression algorithm, the compression and decompression speed of the Simple9 compression algorithm is improved by mapping the equal bit operation. The experimental results show that the proposed packet Simple9 and dense Simple9 (DenseSimple9) compression algorithm can significantly improve the performance of the compression rate, compression and decompression. (2) to improve the query of the index data of the search engine system Yes, this paper studies the popular MaxScore dynamic pruning algorithm. Since the current MaxScore query algorithm is selected in the necessary table (Essential Lists), the inverted item of the unnecessary table (Non-essential Lists) is only carried out by the fractional contribution accumulating.MaxScore query algorithm to the necessary table. In this case, the MaxScore query algorithm has a serious problem of candidate document failure. In this case, a multi-layer self indexing structure (Hierarchical Self-index) is first implemented to support the random access of the inverted list, and then the maximum possible fraction of the candidate documents is detected by two-way detection. The quick jump access (Essential Lists Skipping, ELS) of the necessary table is implemented. This enables the pre detection and random access of candidate documents in the necessary table so that the problem of the score failure of the candidate documents can be discovered as soon as possible and avoids some useless operations on the candidate documents to be invalid. The experimental results show that this article is proposed in this paper. The ELS-MaxScore query algorithm can greatly improve the performance of Top-k query, especially when the query length is less than 4. (3) one way to find the performance of the search engine query is that the selection of candidate documents should be mainly concentrated in the inverted chain corresponding to the more important words. On the basis of analyzing the problem of WAND query algorithm and studying the advantages of radical MaxScore (Aggressive MaxScore), in order to make better use of the important character of word item, this paper proposes the maximum important degree priority (Largest Score First, LSF) query algorithm, which makes the query words with higher importance point to point. The analysis shows that the LSF query algorithm can solve the problem of random selection and memory consumption of candidate documents in the two exhaustive traversal algorithms (Document At A Time, DAAT) and (Term At A Time, TAAT). In order to further support the efficient dynamic pruning algorithm, this paper uses a LSF query for the words. Two precise dynamic pruning algorithms: Term Omitting, LSF_TO and Partial Scoring (LSF_PS) based on LSF. Based on the experimental results on TREC GOV2, two kinds of dynamic pruning algorithms based on LSF index traversal are proposed in this paper. Query performance reach based on DAAT index traversal of the WAND and the MaxScore two kind of dynamic pruning algorithm is better.

【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3

【相似文獻(xiàn)】

相關(guān)博士學(xué)位論文 前1條

1 姜琨;搜索引擎中的索引壓縮和查詢問(wèn)題研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2015年

,

本文編號(hào):1775806

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1775806.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶32d7c***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com