基于多核的細粒度并行的集合相似連接

發(fā)布時間：2018-05-05 08:47

本文選題：相似連接 + 并行　；參考：《計算機學報》2017年10期

【摘要】：相似連接是指在給定的兩個數(shù)據(jù)集中,根據(jù)給定的相似性度量函數(shù)來計算數(shù)據(jù)之間的相似度,并找出所有相似度不小于給定閾值的數(shù)據(jù)對的操作.相似連接作為一個基本的操作,被廣泛地應用于各種領域.隨著網(wǎng)絡和移動應用等信息技術的不斷發(fā)展,數(shù)據(jù)呈現(xiàn)爆炸式增長,海量數(shù)據(jù)的分析需要強大的計算能力.為了滿足不斷增長的計算需求,提高計算效率和計算性能,計算機的體系結構也不斷升級,出現(xiàn)了多核多處理器架構.為了提高相似連接的效率和計算資源的利用率,文中提出了基于多核的并行相似連接方法.相似連接操作與傳統(tǒng)關系數(shù)據(jù)庫中結構化數(shù)據(jù)之間的連接操作不同,相似連接處理的是異構數(shù)據(jù),每一條數(shù)據(jù)處理的代價與其結構有關.為了實現(xiàn)多核處理器之間的任務均衡,文中提出了適合相似連接操作特點的數(shù)據(jù)長度均衡的數(shù)據(jù)劃分方法.根據(jù)相似連接操作遵循Filter-Refine兩階段框架的特點,結合現(xiàn)代計算機體系結構的多核特性,提出了基于共享索引的任務分解方法和基于獨立索引的任務分解方法.文中利用提出的數(shù)據(jù)劃分方法和任務分解方法,實現(xiàn)了基于多核的并行化相似連接算法,包括自連接和R-S連接.文中對兩種不同的實現(xiàn)方式的時間代價進行了分析,其中包括索引更新、索引掃描以及集合交運算的代價,從理論分析的角度證明了數(shù)據(jù)長度均衡劃分和獨立索引的實現(xiàn)方式在執(zhí)行效率上具有優(yōu)勢.文中實驗部分利用不同的數(shù)據(jù)集在多核多處理器平臺上對并行相似連接的不同實現(xiàn)方式的執(zhí)行效率和可擴展性進行了驗證.實驗結果證明,文中提出的數(shù)據(jù)長度均衡的數(shù)據(jù)劃分方法以及基于獨立索引的任務分解方法可以有效地提高并行相似連接的執(zhí)行效率,在16核平臺上使用16個線程在DBLP數(shù)據(jù)集上執(zhí)行并行的相似自連接以及在CiteSeer和DBLP兩個數(shù)據(jù)集上執(zhí)行并行的R-S連接都可以在1秒內完成.
[Abstract]:Similarity join is an operation that calculates the similarity between data according to the given similarity measure function in two given data sets and finds out all the data pairs whose similarity is not less than a given threshold. As a basic operation, similar join is widely used in various fields. With the development of information technology, such as network and mobile application, the data is increasing explosively. The analysis of massive data needs powerful computing power. In order to meet the increasing demand of computing and improve the computing efficiency and performance, the architecture of computer has been continuously upgraded, and a multi-core and multi-processor architecture has emerged. In order to improve the efficiency of similar join and the utilization of computing resources, a parallel similar join method based on multi-core is proposed in this paper. The similar join operation is different from the connection operation between the structured data in the traditional relational database. The similar join deals with the heterogeneous data, and the cost of each data processing is related to its structure. In order to realize the task equalization between multi-core processors, a data partition method of data length equalization suitable for the characteristics of similar connection operation is proposed in this paper. According to the characteristics of the similar join operation following the Filter-Refine two-stage framework and the multi-core characteristics of modern computer architecture, a task decomposition method based on shared index and a task decomposition method based on independent index are proposed. Using the proposed data partitioning method and task decomposition method, a parallel parallel similar join algorithm based on multi-core is implemented, including self-join and R-S connection. In this paper, the time cost of two different implementations is analyzed, including the cost of index update, index scan and set intersection. From the point of view of theoretical analysis, it is proved that the implementation of data length equilibrium partition and independent index has advantages in execution efficiency. In the experiment part, we use different data sets to verify the efficiency and expansibility of parallel similar connection implementation on multi-core and multi-processor platform. Experimental results show that the proposed data length equalization method and task decomposition method based on independent index can effectively improve the efficiency of parallel similar join execution. Using 16 threads on the 16-core platform to perform parallel similar self-join on DBLP dataset and parallel R-S connection on CiteSeer and DBLP data sets can be completed in one second.
【作者單位】：天津工業(yè)大學計算機科學與軟件學院;
【基金】：國家自然科學基金(61402329,61373104) 國家留學基金委資助~~
【分類號】：TP311.13

【相似文獻】

相關期刊論文前10條

1 夏軍,楊學軍;基于數(shù)據(jù)空間融合的全局計算與數(shù)據(jù)劃分方法[J];軟件學報;2004年09期

2 賈婷;魏祖寬;唐曙光;金在弘;;一種面向并行空間查詢的數(shù)據(jù)劃分方法[J];計算機科學;2010年08期

3 董春麗;趙榮彩;杜澎;王崢;;基于線性不等式的數(shù)據(jù)劃分方法的優(yōu)化[J];計算機應用;2007年05期

4 楊小虎;王新宇;毛明;;基于數(shù)據(jù)劃分的分布式模型及其負載均衡算法[J];浙江大學學報(工學版);2008年04期

5 丁強,臧斌宇,朱傳琪;一種動態(tài)分布數(shù)組的數(shù)據(jù)劃分模式[J];計算機工程與設計;2005年05期

6 丁強 ,臧斌宇 ,朱傳琪;基于指針數(shù)組的數(shù)據(jù)劃分模式[J];計算機工程與應用;2005年27期

7 錢辰;竇萬峰;;面向離散點云并行插值數(shù)據(jù)劃分方法研究[J];南京師范大學學報(工程技術版);2013年02期

8 呂成;金登男;;基于關系的并行數(shù)據(jù)倉庫的數(shù)據(jù)劃分和操作[J];計算機應用研究;2006年08期

9 仲躋冬,李曉明,方濱興;HPF計算劃分的算法實現(xiàn)[J];計算機工程與科學;1997年02期

10 宋效東;竇萬峰;湯國安;江嶺;趙菁;趙明偉;;分布式并行地形分析中數(shù)據(jù)劃分機制研究[J];國防科技大學學報;2013年01期

相關會議論文前4條

1 楊東華;李建中;張文平;;基于數(shù)據(jù)網(wǎng)格環(huán)境的連接操作算法[A];第二十一屆中國數(shù)據(jù)庫學術會議論文集（研究報告篇）[C];2004年

2 董光波;吳寧生;高效;曾慶虎;楊進;溫京;;一種組件式多線程網(wǎng)絡應用架構的設計與實現(xiàn)[A];2009年中國智能自動化會議論文集（第六分冊）[中南大學學報（增刊）][C];2009年

3 肖靜靜;李雙峰;彭智勇;;用多線程方式優(yōu)化PostgreSQL的查詢處理[A];第二十屆全國數(shù)據(jù)庫學術會議論文集（技術報告篇）[C];2003年

4 高齊新;揚金柱;趙大哲;劉積仁;;基于多線程的三維醫(yī)學影像的重建[A];第十四屆全國圖象圖形學學術會議論文集[C];2008年

相關重要報紙文章前1條

1 武漢 Tianyi;創(chuàng)建簡單的多線程程序[N];電腦報;2001年

相關博士學位論文前4條

1 逄龍;多線程程序中關聯(lián)變量原子性驗證關鍵技術研究[D];哈爾濱工業(yè)大學;2015年

2 吳振東;并行程序中bug檢測技術研究[D];國防科學技術大學;2015年

3 趙榮彩;多線程低功耗編譯優(yōu)化技術研究[D];中國科學院研究生院（計算技術研究所）;2002年

4 陳梅;面向復雜數(shù)據(jù)的聚類算法研究[D];蘭州大學;2016年

相關碩士學位論文前10條

1 朱振華;基于虛擬化部署的高能效數(shù)據(jù)庫集群設計[D];哈爾濱工業(yè)大學;2015年

2 王倩;大圖數(shù)據(jù)啟發(fā)式劃分與管理及在BC-BSP系統(tǒng)中的應用研究[D];東北大學;2014年

3 孫星宇;基于MapReduce的kNN-join算法的研究與設計[D];黑龍江大學;2016年

4 羅浩;分布式環(huán)境下Top-K計算問題研究[D];東南大學;2016年

5 郝峗;可擴展的面向關聯(lián)的流式圖數(shù)據(jù)劃分方法研究[D];華中科技大學;2015年

6 錢辰;面向DEM點云數(shù)據(jù)的并行插值數(shù)據(jù)劃分優(yōu)化方法研究[D];南京師范大學;2013年

7 高峰;基于BSP模型的大圖處理系統(tǒng)數(shù)據(jù)劃分模塊的設計與實現(xiàn)[D];東北大學;2012年

8 程佳;一種基于Hadoop的RDF數(shù)據(jù)劃分與存儲研究[D];南京大學;2013年

9 張雷;基于Spark系統(tǒng)的查詢分析及優(yōu)化研究[D];北京交通大學;2016年

10 李佳寧;GPU上基于Hadoop的高效連接操作算法研究[D];哈爾濱工業(yè)大學;2016年

，

本文編號：1847027

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1847027.html

上一篇：支持聯(lián)機分析處理的推特用戶興趣維層次提取方法
下一篇：石河子院校就業(yè)管理系統(tǒng)的設計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于多核的細粒度并行的集合相似連接