天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于分區(qū)的倒排索引壓縮算法研究

發(fā)布時間:2018-11-29 09:38
【摘要】:倒排索引是目前應(yīng)用最為廣泛的全文索引技術(shù),是現(xiàn)代搜索引擎的核心技術(shù),F(xiàn)在互聯(lián)網(wǎng)上文本數(shù)據(jù)呈現(xiàn)爆炸式增長,為這些文本數(shù)據(jù)構(gòu)造的倒排索引也需要越來越多的存儲空間,壓縮倒排索引不僅可以節(jié)省磁盤空間,同時也可以減少查詢時所需要的磁盤和內(nèi)存訪問次數(shù)。因此,倒排索引壓縮的研究對于海量文本數(shù)據(jù)的全文檢索有很重要的意義。壓縮倒排索引主要是壓縮倒排索引中的docID序列、frequency序列,這兩類序列的壓縮最終可以轉(zhuǎn)化為對非負(fù)單調(diào)遞增整數(shù)序列的壓縮。基于分區(qū)的Elias-Fano(PEF)編碼壓縮算法提出分區(qū)二級數(shù)據(jù)結(jié)構(gòu),對整數(shù)序列進(jìn)行分區(qū)塊壓縮,顯示出良好的壓縮性能,本文針對整數(shù)序列"分區(qū)"壓縮的思想深入研究,主要工作如下:(1)證明了 Golomb-Rice編碼的壓縮性能優(yōu)于Elias-Fano編碼的壓縮性能,并且對于給定n個元素、上界為u的非負(fù)單調(diào)遞增整數(shù)序列,本文提出一種為該序列直接確定Golomb-Rice編碼參數(shù)b的方法。(2)結(jié)合Golomb-Rice編碼優(yōu)秀的壓縮性能和整數(shù)序列"分區(qū)"壓縮的思想,提出了基于分區(qū)的Elias-Fano-Golomb-Rice(PEFGR)編碼壓縮算法和基于分區(qū)的Golomb-Rcie(PGR)編碼壓縮算法。PEFGR編碼壓縮算法將整數(shù)序列劃分為若干區(qū)塊,構(gòu)造一種二級數(shù)據(jù)結(jié)構(gòu):第一級結(jié)構(gòu)存儲整數(shù)序列每個區(qū)塊邊界元素組成序列的Elias-Fano編碼結(jié)果;第二級結(jié)構(gòu)存儲整數(shù)序列每個區(qū)塊內(nèi)部元素的Golomb-Rice編碼結(jié)果。PGR編碼壓縮算法在PEFGR編碼壓縮算法的基礎(chǔ)上,將其第一級結(jié)構(gòu)存儲的元素也改用壓縮性能更好的Golomb-Rice編碼,進(jìn)一步提高算法的壓縮性能。(3)實現(xiàn)了本文提出的PEFGR編碼壓縮算法和PGR編碼壓縮算法,同時實現(xiàn)了 Golomb-Rice編碼壓縮算法、Elias-Fano編碼壓縮算法、PEF編碼壓縮算法、Simple-16編碼壓縮算法、OptPFD編碼壓縮算法用于對比。實驗數(shù)據(jù)集采用的是Clueweb12和wikipediaXML。實驗結(jié)果驗證了 Golomb-Rice編碼的壓縮性能優(yōu)于Elias-Fano編碼的壓縮性能;實驗結(jié)果表明出這些壓縮算法中PGR編碼壓縮算法的壓縮性能最優(yōu),PEFGR編碼壓縮算法次之。在docID方面,PGR編碼壓縮算法比PEF編碼壓縮算法平均每個整數(shù)至少節(jié)省0.3個bit,比Simple-16編碼壓縮算法、OptPFD編碼壓縮算法節(jié)省更多的bits。
[Abstract]:Inverted index is the most widely used full-text index technology and the core technology of modern search engine. Now text data on the Internet is growing explosively, and the inverted index for these text data also needs more and more storage space. Compression of inverted index can not only save disk space, It can also reduce the number of disk and memory access required when querying. Therefore, the research of inverted index compression is very important for full-text retrieval of massive text data. Compressed inverted index is mainly composed of docID sequence and frequency sequence in inverted index. The compression of these two kinds of sequences can be transformed into compression of non-negative monotone increasing integer sequence. Based on the partition Elias-Fano (PEF) coding compression algorithm, the partitioned two-level data structure is proposed, and the integer sequence is compressed into blocks, which shows good compression performance. In this paper, the idea of "partitioning" compression of integer sequences is deeply studied. The main works are as follows: (1) it is proved that the compression performance of Golomb-Rice coding is better than that of Elias-Fano coding, and for a given n elements, the upper bound is u, which is a nonnegative monotone increasing integer sequence. In this paper, a method for determining the Golomb-Rice coding parameter b for the sequence is proposed. (2) combining the excellent compression performance of Golomb-Rice coding and the idea of integer sequence "partition" compression, The Elias-Fano-Golomb-Rice (PEFGR) coding compression algorithm based on partition and Golomb-Rcie (PGR) coding compression algorithm based on partition are proposed. The integer sequence is divided into several blocks by PEFGR coding compression algorithm. A secondary data structure is constructed: the first level structure stores the Elias-Fano coding result of each block boundary element composition sequence of integer sequence; The second level structure stores the Golomb-Rice coding result of the element inside each block of integer sequence. Based on the PEFGR coding compression algorithm, the PGR coding algorithm also converts the elements stored in the first level structure to Golomb-Rice coding with better compression performance. Further improve the compression performance of the algorithm. (3) the implementation of the proposed PEFGR coding compression algorithm and PGR coding compression algorithm, at the same time the implementation of Golomb-Rice coding compression algorithm, Elias-Fano coding compression algorithm, PEF coding compression algorithm, Simple-16 coding compression algorithm, OptPFD coding compression algorithm for comparison. The experimental data set uses Clueweb12 and wikipediaXML. The experimental results show that the compression performance of Golomb-Rice coding is better than that of Elias-Fano coding, and the experimental results show that the compression performance of PGR coding algorithm is the best, followed by PEFGR coding compression algorithm. In terms of docID, the PGR coding compression algorithm saves at least 0.3 bit, per integer on average than the PEF coding compression algorithm, and the OptPFD coding compression algorithm saves more bits. than the Simple-16 coding compression algorithm.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 毛福林;瞿有利;;一種變長編碼壓縮倒排索引算法[J];山東大學(xué)學(xué)報(理學(xué)版);2014年12期

2 劉小珠;彭智勇;;全文索引技術(shù)時空效率分析[J];軟件學(xué)報;2009年07期

相關(guān)碩士學(xué)位論文 前3條

1 郭爭文;基于TermID序列排序的標(biāo)識符重分配的倒排索引壓縮研究[D];北京交通大學(xué);2016年

2 毛福林;倒排索引壓縮算法研究[D];北京交通大學(xué);2015年

3 劉小華;壓縮全文索引的研究[D];北京交通大學(xué);2014年



本文編號:2364643

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2364643.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶50bae***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com