基于倒排索引的集合T覆蓋查詢算法研究

發(fā)布時(shí)間：2019-03-04 13:47

【摘要】：T覆蓋是集合相似度查詢的基礎(chǔ),在推薦系統(tǒng)、文檔近似重復(fù)檢測和刪除、web搜索的自動(dòng)補(bǔ)全、實(shí)體解析、文檔聚類、抄襲檢測、數(shù)據(jù)清洗、社會(huì)網(wǎng)絡(luò)挖掘、基因序列比對(duì)等多個(gè)領(lǐng)域有重要的應(yīng)用。傳統(tǒng)的索引結(jié)構(gòu)下的T覆蓋查詢效率較低,而基于位圖索引的T覆蓋計(jì)算都可以用AND、OR等簡單快速的邏輯位運(yùn)算表示,因而計(jì)算效率較高。通�；谖粓D索引的T覆蓋查詢使用壓縮位圖,但壓縮位圖由于更新操作較為復(fù)雜,不適用于數(shù)據(jù)更新頻繁的場合。為了解決這個(gè)問題,本文首先設(shè)計(jì)一種高效的分段位圖索引結(jié)構(gòu) SBII(Segmented Bitmap Inverted Indexes),SBII 既考慮位圖存儲(chǔ)空間開銷的問題,又充分利用了位圖運(yùn)算速度快的特點(diǎn),并且避免了壓縮位圖更新的問題。其次,在SBII的基礎(chǔ)上實(shí)現(xiàn)了 T覆蓋查詢算法SBII_Looped。最后,在數(shù)據(jù)集上的實(shí)驗(yàn)表明:相比傳統(tǒng)的T覆蓋查詢算法,SBII_Looped在效率上有明顯提高。傳統(tǒng)的基于CPU平臺(tái)的T覆蓋查詢算法效率以及吞吐率較低,為解決這一問題,本文研究基于GPU的并行T覆蓋查詢算法。首先,設(shè)計(jì)一種高效的GPU分段倒排索引結(jié)構(gòu) GSII(GPU Segmentation Inverted Indexes),GSII 充分考慮 GPU共享內(nèi)存的特點(diǎn),對(duì)倒排索引進(jìn)行hash分段,使得對(duì)每一分段的訪問均能在共享內(nèi)存中進(jìn)行。其次,在GSII的基礎(chǔ)上,實(shí)現(xiàn)高效的分組并行T覆蓋查詢算法GSPS,該算法對(duì)查詢進(jìn)行分組,同時(shí)支持查詢內(nèi)并行和查詢間并行,因此該算法具有較高的效率。最后,在真實(shí)數(shù)據(jù)集上進(jìn)行的實(shí)驗(yàn)表明:該算法相比目前最好的GPU并行T覆蓋查詢算法,性能提升了 30%左右。
[Abstract]:T coverage is the basis of set similarity query. In recommendation system, document approximate repeat detection and deletion, automatic completion of web search, entity parsing, document clustering, plagiarism detection, data cleaning, social network mining, Gene sequence alignment has important applications in many fields. The efficiency of T-coverage query under the traditional index structure is low, and the T-coverage calculation based on bitmap index can be represented by simple and fast logic bit operation such as AND,OR, so the computation efficiency is high. The T-overlay query based on bitmap index usually uses compressed bitmap, but because of the complexity of update operation, compressed bitmap is not suitable for the occasion of frequent data update. In order to solve this problem, this paper first designs an efficient segmented bitmap index structure, SBII (Segmented Bitmap Inverted Indexes), SBII, which not only takes into account the storage space overhead of bitmaps, but also makes full use of the characteristics of high computing speed of bitmaps. It also avoids the problem of updating compressed bitmap. Secondly, the T-coverage query algorithm SBII_Looped. is implemented on the basis of SBII. Finally, the experiment on the data set shows that compared with the traditional T-overlay query algorithm, the efficiency of SBII_Looped is improved obviously. The traditional T-coverage query algorithm based on CPU platform has low efficiency and throughput. In order to solve this problem, this paper studies the parallel T-coverage query algorithm based on GPU. Firstly, an efficient GPU segmented inverted index structure (GSII (GPU Segmentation Inverted Indexes), GSII) is designed, which takes full account of the characteristics of GPU shared memory. The inverted index is hash segmented so that every segment can be accessed in shared memory. Secondly, on the basis of GSII, an efficient grouping parallel T-overlay query algorithm (GSPS,) is implemented, which can group queries and support both intra-query parallelism and inter-query parallelism, so this algorithm has high efficiency. Finally, experiments on real data sets show that the performance of this algorithm is improved by about 30% compared with the best GPU parallel T-overlay query algorithm.
【學(xué)位授予單位】：昆明理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP311.13;TP391.3

【參考文獻(xiàn)】

中國期刊全文數(shù)據(jù)庫前4條

1 王梅;楊思簫;樂嘉錦;;列存儲(chǔ)數(shù)據(jù)庫中壓縮位圖索引技術(shù)[J];計(jì)算機(jī)工程;2012年18期

2 賈連印;奚建清;李孟娟;游進(jìn)國;劉勇;苗德成;;Dtrie-allpair:高效的集合T-覆蓋連接算法[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年06期

3 潘磊;雷鈺麗;王崇駿;謝俊元;;基于權(quán)重的Jaccard相似度度量的實(shí)體識(shí)別方法[J];北京交通大學(xué)學(xué)報(bào);2009年06期

4 宋韶旭;李春平;;基于非對(duì)稱相似度的文本聚類方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2006年07期

中國博士學(xué)位論文全文數(shù)據(jù)庫前2條

1 張帆;搜索引擎中索引表求交和提前停止技術(shù)優(yōu)化研究[D];南開大學(xué);2012年

2 賈連印;內(nèi)存數(shù)據(jù)庫中集合相似度及集合包含問題的研究[D];華南理工大學(xué);2012年

中國碩士學(xué)位論文全文數(shù)據(jù)庫前7條

1 高鵬;推薦系統(tǒng)中信息相似度的研究及其應(yīng)用[D];上海交通大學(xué);2013年

2 吳斐;基于N-gram的程序代碼抄襲檢測方法研究[D];西南大學(xué);2012年

3 閆丹;基于基因序列比對(duì)的立體匹配算法[D];哈爾濱理工大學(xué);2012年

4 鄭寰;數(shù)據(jù)備份中基于相似性的重復(fù)數(shù)據(jù)刪除的研究[D];華中科技大學(xué);2012年

5 盧克;基于Wikipedia的社會(huì)網(wǎng)絡(luò)挖掘[D];哈爾濱工業(yè)大學(xué);2009年

6 劉華;Web信息集成中數(shù)據(jù)清洗的研究[D];武漢理工大學(xué);2007年

7 張長利;網(wǎng)頁相似性算法的研究與實(shí)現(xiàn)[D];吉林大學(xué);2005年

，

本文編號(hào)：2434317

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2434317.html

上一篇：基于Hadoop平臺(tái)的LDA短文本分類算法研究
下一篇：基于自適應(yīng)目標(biāo)圖像恢復(fù)的紅外弱小目標(biāo)檢測

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于倒排索引的集合T覆蓋查詢算法研究