當(dāng)前位置：主頁 > 碩博論文 > 基礎(chǔ)科學(xué)碩士論文 >

基于Spark的子圖匹配算法研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-01-20 02:20

本文關(guān)鍵詞： Spark 子圖匹配圖挖掘并行算法　出處：《北京交通大學(xué)》2017年碩士論文　論文類型：學(xué)位論文

【摘要】：圖作為一種由頂點(diǎn)和邊構(gòu)成的數(shù)據(jù)結(jié)構(gòu),能夠簡潔有力的表達(dá)事物之間的聯(lián)系。隨著大數(shù)據(jù)時(shí)代的到來,數(shù)據(jù)的規(guī)模以前所未有的速度增長著,Facebook、Twitter、微博等社交媒體每天都產(chǎn)生大量的社交圖數(shù)據(jù)。如何處理如此大規(guī)模的圖數(shù)據(jù)成為目前研究的熱點(diǎn)。其中,子圖匹配問題又是圖數(shù)據(jù)處理領(lǐng)域中最為重要的問題,圖的搜索,模式匹配等算法都需要子圖匹配算法的支持。子圖匹配的數(shù)學(xué)基礎(chǔ)是圖論中的經(jīng)典問題子圖同構(gòu),一個(gè)著名的NP完全問題。目前,雖然有一些學(xué)者給出了一些方法來實(shí)現(xiàn)子圖匹配,但是這些方法只能處理小規(guī)模的圖數(shù)據(jù),在應(yīng)對(duì)如今大規(guī)模的數(shù)據(jù)時(shí),其匹配效率與可擴(kuò)展性都有很大不足。另外,多數(shù)子圖匹配算法應(yīng)用于無向圖,在有向圖的應(yīng)用上存在著不適用或效率低下的問題。針對(duì)以上問題。本文依托近些年來興起的大數(shù)據(jù)平臺(tái),利用其提供強(qiáng)大的存儲(chǔ)與計(jì)算能力,研究并實(shí)現(xiàn)了以大數(shù)據(jù)處理平臺(tái)Spark作為處理引擎,應(yīng)用GraphX組件處理超大規(guī)模圖數(shù)據(jù)的子圖匹配算法。該算法以HDFS為存儲(chǔ)平臺(tái),有效解決了集群負(fù)載均衡問題;計(jì)算過程分為分布式過濾階段與分布式驗(yàn)證階段。分布式過濾階段充分考慮Spark平臺(tái)特性與GraphX以頂點(diǎn)為分割的圖分割策略,提出頂點(diǎn)簽名數(shù)據(jù)結(jié)構(gòu),通過并行比對(duì)頂點(diǎn)簽名的方式實(shí)現(xiàn)對(duì)數(shù)據(jù)圖快速高效過濾。其中,頂點(diǎn)簽名中表達(dá)了自身與鄰域信息,鄰域中又區(qū)分父鄰域與子鄰域,提升了對(duì)有向圖的過濾效果。分布式驗(yàn)證階段利用Spark平臺(tái)分布式計(jì)算優(yōu)勢(shì),提出候選子圖匹配區(qū)域概念,通過對(duì)查詢圖中心點(diǎn)的計(jì)算,在數(shù)據(jù)圖中得到多個(gè)與查詢圖規(guī)模相當(dāng)?shù)暮蜻x子圖匹配區(qū)域,將經(jīng)過過濾的超大規(guī)模圖數(shù)據(jù)進(jìn)一步進(jìn)行分割,在更小規(guī)模候選子圖匹配區(qū)域中執(zhí)行高效子圖匹配操作。最后,通過實(shí)驗(yàn)表明,本文分布式子圖過匹配算法具有很好的匹配效率與可擴(kuò)展能力,在與目前優(yōu)秀子圖匹配算法VF2的對(duì)比實(shí)驗(yàn)中,本文算法具有很好的執(zhí)行效率優(yōu)勢(shì)。
[Abstract]:As a data structure composed of vertices and edges, graphs can express the relationship between things simply and forcefully. With the arrival of big data era, the scale of data is growing at an unprecedented rate. Social media like Facebook Twitter and Weibo generate a lot of social graph data every day. The subgraph matching problem is also the most important problem in the field of graph data processing, graph search. The mathematical foundation of subgraph matching is the classical subgraph isomorphism in graph theory, a famous NP-complete problem. Although some scholars have given some methods to achieve subgraph matching, these methods can only deal with small-scale graph data, when dealing with today's large-scale data. In addition, most subgraph matching algorithms are applied to undirected graph. In the application of directed graphs there are problems of inapplicability or inefficiency. In view of the above problems this paper relies on the big data platform which has emerged in recent years and uses it to provide powerful storage and computing power. A subgraph matching algorithm using big data processing platform Spark as processing engine and using GraphX component to process very large scale graph data is studied and implemented. The algorithm uses HDFS as storage platform. It effectively solves the problem of cluster load balancing. The computing process is divided into distributed filtering phase and distributed verification phase. The distributed filtering phase takes full account of the characteristics of Spark platform and GraphX segmentation strategy based on vertex. The data structure of vertex signature is proposed, and the data graph is filtered quickly and efficiently by parallel vertex signature, in which the information of itself and neighborhood is expressed in the vertex signature, and the parent neighborhood and the sub-neighborhood are distinguished in the neighborhood. The filtering effect of directed graph is improved. In the stage of distributed verification, the concept of candidate subgraph matching region is put forward by using the advantage of distributed computing on Spark platform, and the center point of the query graph is calculated. A number of candidate subgraph matching regions are obtained in the data graph, and the filtered super-large scale graph data is further segmented. An efficient subgraph matching operation is performed in a smaller candidate subgraph matching region. Finally, experiments show that the distributed subgraph over-matching algorithm has good matching efficiency and extensibility. In comparison with the current excellent subgraph matching algorithm VF2, the proposed algorithm has a good performance efficiency.
【學(xué)位授予單位】：北京交通大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：O157.5

【參考文獻(xiàn)】

相關(guān)期刊論文前3條

1 Bin Cui;Hong Mei;Beng Chin Ooi;;Big data: the driver for innovation in databases[J];National Science Review;2014年01期

2 潘巍;李戰(zhàn)懷;伍賽;陳群;;基于消息傳遞機(jī)制的MapReduce圖算法研究[J];計(jì)算機(jī)學(xué)報(bào);2011年10期

3 徐凱旋;魯?shù)婪?;擴(kuò)展子圖同構(gòu)問題的優(yōu)化算法[J];計(jì)算機(jī)工程;2011年19期

相關(guān)碩士學(xué)位論文前3條

1 李燕;基于單鄰域的子圖查詢算法研究[D];燕山大學(xué);2016年

2 焦曉翠;基于V-index的子圖查詢算法的研究與實(shí)現(xiàn)[D];東北大學(xué);2012年

3 黃博;圖數(shù)據(jù)庫中多子圖匹配查詢算法研究[D];復(fù)旦大學(xué);2012年

，

本文編號(hào)：1446453

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/benkebiyelunwen/1446453.html

上一篇：基于摻鉺光纖動(dòng)態(tài)光柵與光纖Bragg光柵F-P腔特性的研究
下一篇：泥河灣虎頭梁剖面孢粉、磁化率、粒度分析及其環(huán)境意義

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Spark的子圖匹配算法研究與實(shí)現(xiàn)