基于Spark的子圖匹配算法研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-01-20 02:20
本文關(guān)鍵詞: Spark 子圖匹配 圖挖掘 并行算法 出處:《北京交通大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:圖作為一種由頂點(diǎn)和邊構(gòu)成的數(shù)據(jù)結(jié)構(gòu),能夠簡(jiǎn)潔有力的表達(dá)事物之間的聯(lián)系。隨著大數(shù)據(jù)時(shí)代的到來(lái),數(shù)據(jù)的規(guī)模以前所未有的速度增長(zhǎng)著,Facebook、Twitter、微博等社交媒體每天都產(chǎn)生大量的社交圖數(shù)據(jù)。如何處理如此大規(guī)模的圖數(shù)據(jù)成為目前研究的熱點(diǎn)。其中,子圖匹配問(wèn)題又是圖數(shù)據(jù)處理領(lǐng)域中最為重要的問(wèn)題,圖的搜索,模式匹配等算法都需要子圖匹配算法的支持。子圖匹配的數(shù)學(xué)基礎(chǔ)是圖論中的經(jīng)典問(wèn)題子圖同構(gòu),一個(gè)著名的NP完全問(wèn)題。目前,雖然有一些學(xué)者給出了一些方法來(lái)實(shí)現(xiàn)子圖匹配,但是這些方法只能處理小規(guī)模的圖數(shù)據(jù),在應(yīng)對(duì)如今大規(guī)模的數(shù)據(jù)時(shí),其匹配效率與可擴(kuò)展性都有很大不足。另外,多數(shù)子圖匹配算法應(yīng)用于無(wú)向圖,在有向圖的應(yīng)用上存在著不適用或效率低下的問(wèn)題。針對(duì)以上問(wèn)題。本文依托近些年來(lái)興起的大數(shù)據(jù)平臺(tái),利用其提供強(qiáng)大的存儲(chǔ)與計(jì)算能力,研究并實(shí)現(xiàn)了以大數(shù)據(jù)處理平臺(tái)Spark作為處理引擎,應(yīng)用GraphX組件處理超大規(guī)模圖數(shù)據(jù)的子圖匹配算法。該算法以HDFS為存儲(chǔ)平臺(tái),有效解決了集群負(fù)載均衡問(wèn)題;計(jì)算過(guò)程分為分布式過(guò)濾階段與分布式驗(yàn)證階段。分布式過(guò)濾階段充分考慮Spark平臺(tái)特性與GraphX以頂點(diǎn)為分割的圖分割策略,提出頂點(diǎn)簽名數(shù)據(jù)結(jié)構(gòu),通過(guò)并行比對(duì)頂點(diǎn)簽名的方式實(shí)現(xiàn)對(duì)數(shù)據(jù)圖快速高效過(guò)濾。其中,頂點(diǎn)簽名中表達(dá)了自身與鄰域信息,鄰域中又區(qū)分父鄰域與子鄰域,提升了對(duì)有向圖的過(guò)濾效果。分布式驗(yàn)證階段利用Spark平臺(tái)分布式計(jì)算優(yōu)勢(shì),提出候選子圖匹配區(qū)域概念,通過(guò)對(duì)查詢圖中心點(diǎn)的計(jì)算,在數(shù)據(jù)圖中得到多個(gè)與查詢圖規(guī)模相當(dāng)?shù)暮蜻x子圖匹配區(qū)域,將經(jīng)過(guò)過(guò)濾的超大規(guī)模圖數(shù)據(jù)進(jìn)一步進(jìn)行分割,在更小規(guī)模候選子圖匹配區(qū)域中執(zhí)行高效子圖匹配操作。最后,通過(guò)實(shí)驗(yàn)表明,本文分布式子圖過(guò)匹配算法具有很好的匹配效率與可擴(kuò)展能力,在與目前優(yōu)秀子圖匹配算法VF2的對(duì)比實(shí)驗(yàn)中,本文算法具有很好的執(zhí)行效率優(yōu)勢(shì)。
[Abstract]:As a data structure composed of vertices and edges, graphs can express the relationship between things simply and forcefully. With the arrival of big data era, the scale of data is growing at an unprecedented rate. Social media like Facebook Twitter and Weibo generate a lot of social graph data every day. The subgraph matching problem is also the most important problem in the field of graph data processing, graph search. The mathematical foundation of subgraph matching is the classical subgraph isomorphism in graph theory, a famous NP-complete problem. Although some scholars have given some methods to achieve subgraph matching, these methods can only deal with small-scale graph data, when dealing with today's large-scale data. In addition, most subgraph matching algorithms are applied to undirected graph. In the application of directed graphs there are problems of inapplicability or inefficiency. In view of the above problems this paper relies on the big data platform which has emerged in recent years and uses it to provide powerful storage and computing power. A subgraph matching algorithm using big data processing platform Spark as processing engine and using GraphX component to process very large scale graph data is studied and implemented. The algorithm uses HDFS as storage platform. It effectively solves the problem of cluster load balancing. The computing process is divided into distributed filtering phase and distributed verification phase. The distributed filtering phase takes full account of the characteristics of Spark platform and GraphX segmentation strategy based on vertex. The data structure of vertex signature is proposed, and the data graph is filtered quickly and efficiently by parallel vertex signature, in which the information of itself and neighborhood is expressed in the vertex signature, and the parent neighborhood and the sub-neighborhood are distinguished in the neighborhood. The filtering effect of directed graph is improved. In the stage of distributed verification, the concept of candidate subgraph matching region is put forward by using the advantage of distributed computing on Spark platform, and the center point of the query graph is calculated. A number of candidate subgraph matching regions are obtained in the data graph, and the filtered super-large scale graph data is further segmented. An efficient subgraph matching operation is performed in a smaller candidate subgraph matching region. Finally, experiments show that the distributed subgraph over-matching algorithm has good matching efficiency and extensibility. In comparison with the current excellent subgraph matching algorithm VF2, the proposed algorithm has a good performance efficiency.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:O157.5
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 Bin Cui;Hong Mei;Beng Chin Ooi;;Big data: the driver for innovation in databases[J];National Science Review;2014年01期
2 潘巍;李戰(zhàn)懷;伍賽;陳群;;基于消息傳遞機(jī)制的MapReduce圖算法研究[J];計(jì)算機(jī)學(xué)報(bào);2011年10期
3 徐凱旋;魯?shù)婪?;擴(kuò)展子圖同構(gòu)問(wèn)題的優(yōu)化算法[J];計(jì)算機(jī)工程;2011年19期
相關(guān)碩士學(xué)位論文 前3條
1 李燕;基于單鄰域的子圖查詢算法研究[D];燕山大學(xué);2016年
2 焦曉翠;基于V-index的子圖查詢算法的研究與實(shí)現(xiàn)[D];東北大學(xué);2012年
3 黃博;圖數(shù)據(jù)庫(kù)中多子圖匹配查詢算法研究[D];復(fù)旦大學(xué);2012年
,本文編號(hào):1446453
本文鏈接:http://sikaile.net/shoufeilunwen/benkebiyelunwen/1446453.html
最近更新
教材專著