基于倒排索引的集合T覆蓋查詢算法研究
[Abstract]:T coverage is the basis of set similarity query. In recommendation system, document approximate repeat detection and deletion, automatic completion of web search, entity parsing, document clustering, plagiarism detection, data cleaning, social network mining, Gene sequence alignment has important applications in many fields. The efficiency of T-coverage query under the traditional index structure is low, and the T-coverage calculation based on bitmap index can be represented by simple and fast logic bit operation such as AND,OR, so the computation efficiency is high. The T-overlay query based on bitmap index usually uses compressed bitmap, but because of the complexity of update operation, compressed bitmap is not suitable for the occasion of frequent data update. In order to solve this problem, this paper first designs an efficient segmented bitmap index structure, SBII (Segmented Bitmap Inverted Indexes), SBII, which not only takes into account the storage space overhead of bitmaps, but also makes full use of the characteristics of high computing speed of bitmaps. It also avoids the problem of updating compressed bitmap. Secondly, the T-coverage query algorithm SBII_Looped. is implemented on the basis of SBII. Finally, the experiment on the data set shows that compared with the traditional T-overlay query algorithm, the efficiency of SBII_Looped is improved obviously. The traditional T-coverage query algorithm based on CPU platform has low efficiency and throughput. In order to solve this problem, this paper studies the parallel T-coverage query algorithm based on GPU. Firstly, an efficient GPU segmented inverted index structure (GSII (GPU Segmentation Inverted Indexes), GSII) is designed, which takes full account of the characteristics of GPU shared memory. The inverted index is hash segmented so that every segment can be accessed in shared memory. Secondly, on the basis of GSII, an efficient grouping parallel T-overlay query algorithm (GSPS,) is implemented, which can group queries and support both intra-query parallelism and inter-query parallelism, so this algorithm has high efficiency. Finally, experiments on real data sets show that the performance of this algorithm is improved by about 30% compared with the best GPU parallel T-overlay query algorithm.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;TP391.3
【參考文獻(xiàn)】
中國(guó)期刊全文數(shù)據(jù)庫(kù) 前4條
1 王梅;楊思簫;樂嘉錦;;列存儲(chǔ)數(shù)據(jù)庫(kù)中壓縮位圖索引技術(shù)[J];計(jì)算機(jī)工程;2012年18期
2 賈連印;奚建清;李孟娟;游進(jìn)國(guó);劉勇;苗德成;;Dtrie-allpair:高效的集合T-覆蓋連接算法[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年06期
3 潘磊;雷鈺麗;王崇駿;謝俊元;;基于權(quán)重的Jaccard相似度度量的實(shí)體識(shí)別方法[J];北京交通大學(xué)學(xué)報(bào);2009年06期
4 宋韶旭;李春平;;基于非對(duì)稱相似度的文本聚類方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2006年07期
中國(guó)博士學(xué)位論文全文數(shù)據(jù)庫(kù) 前2條
1 張帆;搜索引擎中索引表求交和提前停止技術(shù)優(yōu)化研究[D];南開大學(xué);2012年
2 賈連印;內(nèi)存數(shù)據(jù)庫(kù)中集合相似度及集合包含問題的研究[D];華南理工大學(xué);2012年
中國(guó)碩士學(xué)位論文全文數(shù)據(jù)庫(kù) 前7條
1 高鵬;推薦系統(tǒng)中信息相似度的研究及其應(yīng)用[D];上海交通大學(xué);2013年
2 吳斐;基于N-gram的程序代碼抄襲檢測(cè)方法研究[D];西南大學(xué);2012年
3 閆丹;基于基因序列比對(duì)的立體匹配算法[D];哈爾濱理工大學(xué);2012年
4 鄭寰;數(shù)據(jù)備份中基于相似性的重復(fù)數(shù)據(jù)刪除的研究[D];華中科技大學(xué);2012年
5 盧克;基于Wikipedia的社會(huì)網(wǎng)絡(luò)挖掘[D];哈爾濱工業(yè)大學(xué);2009年
6 劉華;Web信息集成中數(shù)據(jù)清洗的研究[D];武漢理工大學(xué);2007年
7 張長(zhǎng)利;網(wǎng)頁(yè)相似性算法的研究與實(shí)現(xiàn)[D];吉林大學(xué);2005年
,本文編號(hào):2434317
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2434317.html