面向圖數(shù)據(jù)的Top-k檢索算法研究
發(fā)布時間:2018-08-20 15:44
【摘要】:近年來隨著社交網(wǎng)絡、知識圖譜等應用的飛速發(fā)展,圖數(shù)據(jù)大量出現(xiàn)在學術界和工業(yè)界,如何有效管理圖數(shù)據(jù)并從中挖掘有價值的信息已經(jīng)成為當前數(shù)據(jù)管理領域的研究熱點。其中面向圖數(shù)據(jù)的Top-k檢索問題廣泛存在于在各類應用中,旨在從圖中找出與查詢最相關的k個匹配結果,是圖數(shù)據(jù)管理的一個重要研究問題。針對當前圖數(shù)據(jù)的數(shù)據(jù)規(guī)模大、節(jié)點和邊的種類多、內容動態(tài)變化等特點,本研究重點關注面向圖數(shù)據(jù)的Top-k檢索中三個關鍵問題:分布式環(huán)境的Top-k子圖檢索、多數(shù)據(jù)源的Top-k子圖檢索和動態(tài)數(shù)據(jù)的持續(xù)相關Top-k檢索,歸納總結已有工作的優(yōu)點和不足,提出相應改進算法。本研究的創(chuàng)新點主要有:·提出基于星形查詢拆分的分布式Top-k子圖檢索算法。算法基于星形子圖拆分策略,包含對原始查詢進行子查詢拆分、檢索子查詢匹配結果、合并子查詢匹配結果三個主要步驟。本研究基于算法停止條件的相關特點對算法子查詢執(zhí)行和合并子查詢匹配結果過程進行了三方面的優(yōu)化,進一步降低了算法的跨節(jié)點數(shù)據(jù)傳輸開銷。實驗結果表明:算法具有很好的可擴展性,優(yōu)化后算法執(zhí)行時間比已有方法平均低30%-40%。·提出面向多數(shù)據(jù)源Top-k子圖檢索的排序連接算法。算法包含構造來自不同數(shù)據(jù)源的匹配結果列表、連接結果列表并計算所得完整結果的分數(shù)和判斷算法停止條件是否滿足三個主要步驟。針對已有方法計算尚未訪問結果得分上界松弛的問題,本研究提出優(yōu)化的匹配結果列表排序函數(shù)和基于最小下降分數(shù)的結果列表讀取策略,并從理論上證明了排序函數(shù)最優(yōu)參數(shù)配置應該滿足的條件,以及優(yōu)化讀取策略產生的結果訪問順序最優(yōu)。實驗結果表明:本研究提出的排序連接算法比已有工作開銷低50%到70%!ぬ岢龌趧討B(tài)區(qū)間窗口劃分的持續(xù)相關Top-k檢索算法。算法基于從時間維度對問題空間進行單位時間區(qū)間劃分的思想,在每個區(qū)間執(zhí)行Top-k檢索,再歸并計算持續(xù)相關Top-k結果。針對單位區(qū)間劃分可擴展性差的問題,本研究進一步提出基于動態(tài)區(qū)間窗口劃分的算法,檢索過程中動態(tài)維護時間窗口,避免不必要區(qū)間劃分,減少了子問題的數(shù)量,同時合并不同區(qū)間重復的計算子過程,提升了算法的空間和時間效率。實驗結果表明:基于動態(tài)區(qū)間窗口劃分算法在消耗更少內存的情況下,執(zhí)行時間比已有方法低50%以上。
[Abstract]:In recent years, with the rapid development of social network, knowledge map and other applications, graph data appear in academia and industry. How to effectively manage graph data and mine valuable information has become a research hotspot in the field of data management. The problem of graph data oriented Top-k retrieval is widely used in all kinds of applications. It is an important research problem in graph data management to find k matching results that are most relevant to query from graph. In view of the characteristics of large scale of graph data, variety of nodes and edges and dynamic change of content, this paper focuses on three key problems in Top-k retrieval for graph data: Top-k subgraph retrieval in distributed environment. The Top-k subgraph retrieval of multiple data sources and the continuous correlated Top-k retrieval of dynamic data are summarized, and the advantages and disadvantages of existing work are summarized, and the corresponding improved algorithm is proposed. The main innovations of this study are as follows: a distributed Top-k subgraph retrieval algorithm based on star query splitting is proposed. The algorithm is based on star subgraph splitting strategy, which consists of three main steps: subquery splitting, subquery matching, merging subquery matching. Based on the characteristics of the algorithm stopping condition, this paper optimizes the performance of the algorithm subquery and the matching process of the merged sub-query, and further reduces the cross-node data transmission overhead of the algorithm. The experimental results show that the algorithm has good scalability and the execution time of the optimized algorithm is 30 to 40 less than that of the existing methods. A sorting join algorithm for multi-data source Top-k subgraph retrieval is proposed. The algorithm consists of constructing a list of matching results from different data sources, connecting the result list, calculating the score of the resulting complete result and judging whether the stopping condition of the algorithm meets the three main steps. To solve the problem of calculating the upper bound relaxation of unvisited results by existing methods, this paper proposes an optimized matching result list sorting function and a result list reading strategy based on the minimum descent score. It is proved theoretically that the optimal parameter configuration of the sort function should satisfy the conditions and the optimal access order of the results generated by the optimized reading strategy. The experimental results show that the proposed sorting connection algorithm is 50% to 70% lower than the existing overhead, and a continuous correlation Top-k retrieval algorithm based on dynamic interval window partition is proposed. Based on the idea of dividing the problem space into unit time intervals from the time dimension, the algorithm performs Top-k retrieval in each interval, and then merges and calculates the sustained correlation Top-k results. In order to solve the problem of poor scalability of unit interval partitioning, this paper proposes an algorithm based on dynamic interval window partitioning, which maintains the time window dynamically in the retrieval process, avoids unnecessary interval partitioning, and reduces the number of sub-problems. At the same time, the computation subprocesses with different interval repetition are combined to improve the space and time efficiency of the algorithm. The experimental results show that the execution time of the algorithm based on dynamic interval window partition is more than 50% lower than that of the existing methods under the condition of less memory consumption.
【學位授予單位】:清華大學
【學位級別】:博士
【學位授予年份】:2016
【分類號】:TP391.3
[Abstract]:In recent years, with the rapid development of social network, knowledge map and other applications, graph data appear in academia and industry. How to effectively manage graph data and mine valuable information has become a research hotspot in the field of data management. The problem of graph data oriented Top-k retrieval is widely used in all kinds of applications. It is an important research problem in graph data management to find k matching results that are most relevant to query from graph. In view of the characteristics of large scale of graph data, variety of nodes and edges and dynamic change of content, this paper focuses on three key problems in Top-k retrieval for graph data: Top-k subgraph retrieval in distributed environment. The Top-k subgraph retrieval of multiple data sources and the continuous correlated Top-k retrieval of dynamic data are summarized, and the advantages and disadvantages of existing work are summarized, and the corresponding improved algorithm is proposed. The main innovations of this study are as follows: a distributed Top-k subgraph retrieval algorithm based on star query splitting is proposed. The algorithm is based on star subgraph splitting strategy, which consists of three main steps: subquery splitting, subquery matching, merging subquery matching. Based on the characteristics of the algorithm stopping condition, this paper optimizes the performance of the algorithm subquery and the matching process of the merged sub-query, and further reduces the cross-node data transmission overhead of the algorithm. The experimental results show that the algorithm has good scalability and the execution time of the optimized algorithm is 30 to 40 less than that of the existing methods. A sorting join algorithm for multi-data source Top-k subgraph retrieval is proposed. The algorithm consists of constructing a list of matching results from different data sources, connecting the result list, calculating the score of the resulting complete result and judging whether the stopping condition of the algorithm meets the three main steps. To solve the problem of calculating the upper bound relaxation of unvisited results by existing methods, this paper proposes an optimized matching result list sorting function and a result list reading strategy based on the minimum descent score. It is proved theoretically that the optimal parameter configuration of the sort function should satisfy the conditions and the optimal access order of the results generated by the optimized reading strategy. The experimental results show that the proposed sorting connection algorithm is 50% to 70% lower than the existing overhead, and a continuous correlation Top-k retrieval algorithm based on dynamic interval window partition is proposed. Based on the idea of dividing the problem space into unit time intervals from the time dimension, the algorithm performs Top-k retrieval in each interval, and then merges and calculates the sustained correlation Top-k results. In order to solve the problem of poor scalability of unit interval partitioning, this paper proposes an algorithm based on dynamic interval window partitioning, which maintains the time window dynamically in the retrieval process, avoids unnecessary interval partitioning, and reduces the number of sub-problems. At the same time, the computation subprocesses with different interval repetition are combined to improve the space and time efficiency of the algorithm. The experimental results show that the execution time of the algorithm based on dynamic interval window partition is more than 50% lower than that of the existing methods under the condition of less memory consumption.
【學位授予單位】:清華大學
【學位級別】:博士
【學位授予年份】:2016
【分類號】:TP391.3
【相似文獻】
相關期刊論文 前10條
1 秦天增,王紅玉;二分檢索算法改進初探[J];運城高等?茖W校學報;2002年03期
2 姜敏;張曾科;董道毅;Tzyh-Jong Tarn;;量子二分檢索算法及其實現(xiàn)線路[J];光電子.激光;2007年08期
3 郭立新;吳,
本文編號:2194158
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/2194158.html
最近更新
教材專著