基于MapReduce的空間數(shù)據(jù)RkNN算法研究
發(fā)布時間:2018-04-03 11:33
本文選題:RkNN查詢 切入點:MapReduce 出處:《大連理工大學》2013年碩士論文
【摘要】:近幾年,隨著移動互聯(lián)網(wǎng)技術(shù)和地理信息技術(shù)的發(fā)展,基于位置服務(wù)應(yīng)用逐漸興起,從而使得空間定位信息的數(shù)據(jù)量呈現(xiàn)以指數(shù)級增長。而在地理位置信息相關(guān)的空間數(shù)據(jù)查詢中,RkNN (Reverse k Nearest Neighbor,反最近鄰)查詢問題是通過返回所給查詢點周圍的對象中,以該查詢點作為其kNN (k Nearest Neighbor,最近鄰)的所有對象的集合,其在多種數(shù)據(jù)挖掘應(yīng)用中備受關(guān)注。同樣傳統(tǒng)的RkNN方法也已難以滿足迅猛增長的數(shù)據(jù)速度以及用戶們的大規(guī)模實時查詢要求。 本文將傳統(tǒng)的RkNN算法與MapReduce分布式框架相結(jié)合,分析如何解決大規(guī)?臻g數(shù)據(jù)的分布式RkNN查詢問題。MapReduce是2004年由谷歌公司提出的一個用來進行并行處理和生成大數(shù)據(jù)集的模型,而MapReduce框架作為分布式計算中的典型的離線計算框架,很難實現(xiàn)實時性計算效果。因此,本文采用了離線和在線相結(jié)合的系統(tǒng)模型,利用MapReduce框架離線完成倒排網(wǎng)格索引的創(chuàng)建和更新工作,同時結(jié)合在線計算方法返回RkNN查詢結(jié)果。文中首先提出基于大規(guī)?臻g數(shù)據(jù)集上的倒排網(wǎng)格索引的暴力RkNN查詢算法——Basic-MRkNN算法;接下來提出對此算法的優(yōu)化算法——延遲Lazy-MRkNN查詢算法和增量式Eager-MRkNN查詢算法。為了減少網(wǎng)絡(luò)和磁盤I/O開銷,在過濾過程中利用了一些剪枝規(guī)則來提高本文提出的分布式算法性能。此外,通過在32個節(jié)點的集群上對模擬數(shù)據(jù)集和真實數(shù)據(jù)集的大量實驗表明,本文提出的基本算法在與泰森多邊形分布式RkNN查詢算法相比時能提速50%以上。此外,兩種優(yōu)化算法均優(yōu)于基本算法,Eager-MRkNN算法比較適合處理密集型數(shù)據(jù),而Lazy-MRkNN算法比較適合處理稀疏型數(shù)據(jù)。
[Abstract]:In recent years, with the development of mobile Internet and geographic information technology, the application of location-based services is gradually rising, which makes the amount of spatial location information increase exponentially.In the spatial data query related to geographical location information, the problem of rkNN / inverse Nearest neighbor is to use the query point as the collection of all objects of its kNN Nearest neighbor by returning the object around the given query point.It has attracted much attention in many kinds of data mining applications.The traditional RkNN method is also difficult to meet the rapid growth of data speed and users of large-scale real-time query requirements.This paper combines the traditional RkNN algorithm with the MapReduce distributed framework, and analyzes how to solve the distributed RkNN query problem of large-scale spatial data. MapReduce is a model proposed by Google in 2004 for parallel processing and big data set generation.However, as a typical offline computing framework in distributed computing, MapReduce framework is difficult to achieve real-time computing effect.Therefore, an offline and on-line system model is adopted in this paper. The inverted grid index is created and updated offline using the MapReduce framework, and the RkNN query results are returned with the on-line calculation method.In this paper, we first propose a robust RkNN query algorithm based on inverted grid index on large spatial data sets, Basic-MRkNN algorithm, and then we propose an optimization algorithm for this algorithm, which is the delayed Lazy-MRkNN query algorithm and the incremental Eager-MRkNN query algorithm.In order to reduce the I / O overhead of the network and disk, some pruning rules are used to improve the performance of the distributed algorithm proposed in this paper.In addition, a large number of experiments on simulated data sets and real data sets on 32-node clusters show that the proposed algorithm can increase the speed by more than 50% compared with Tyson polygon distributed RkNN query algorithm.In addition, the two optimization algorithms are better than the basic algorithm, Eager-MRkNN algorithm is more suitable for processing intensive data, while the Lazy-MRkNN algorithm is more suitable for sparse data processing.
【學位授予單位】:大連理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP311.13;TP338.8
【參考文獻】
相關(guān)期刊論文 前1條
1 過志峰,王宇翔,楊崇俊;空間數(shù)據(jù)索引與查詢技術(shù)研究及其應(yīng)用[J];計算機工程與應(yīng)用;2002年23期
相關(guān)博士學位論文 前1條
1 高云君;時空數(shù)據(jù)庫查詢處理關(guān)鍵技術(shù)研究[D];浙江大學;2008年
,本文編號:1705031
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1705031.html
最近更新
教材專著