基于BSP的大規(guī)模圖處理系統(tǒng)中通信和緩存技術(shù)研究
發(fā)布時(shí)間:2018-02-27 03:20
本文關(guān)鍵詞: BSP 圖處理 消息通信 磁盤緩存 云計(jì)算 出處:《東北大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著計(jì)算機(jī)以及網(wǎng)絡(luò)技術(shù)的發(fā)展,在計(jì)算機(jī)集群中采用并行的分布式計(jì)算方式提高計(jì)算處理能力已經(jīng)成為發(fā)展趨勢(shì)。云計(jì)算(Cloud Computing)的一個(gè)最主要的優(yōu)勢(shì)就是它的強(qiáng)大的并行計(jì)算處理能力,而這種能力是建立在一個(gè)簡(jiǎn)便高效的并行編程模型的基礎(chǔ)上的。其中,最有代表性的就是Google提出的MapReduce分布式并行編程模型。然而,隨著近年來(lái)互聯(lián)網(wǎng)應(yīng)用的迅猛發(fā)展,Web網(wǎng)絡(luò)、社交網(wǎng)絡(luò)等大規(guī)模網(wǎng)絡(luò)圖數(shù)據(jù)的分析處理成為了研究熱點(diǎn),例如社交網(wǎng)絡(luò)中的最短路徑、網(wǎng)頁(yè)搜索的PageRank等。這些圖處理問(wèn)題通常需要多次迭代,而MapReduce適合于通用的大數(shù)據(jù)集計(jì)算問(wèn)題,在處理具有多次迭代性質(zhì)的圖挖掘問(wèn)題時(shí)會(huì)導(dǎo)致次優(yōu)的性能。因而這些圖算法往往更適合于采用基于消息傳遞的并行模型來(lái)處理。BSP (Bulk Synchronous Parallel)整體同步并行模型就是一種支持消息傳遞的塊內(nèi)異步并行,塊間顯式同步的并行計(jì)算模型。隨著Google基于BSP模型實(shí)現(xiàn)的大規(guī)模圖處理系統(tǒng)Pregel的提出,在云環(huán)境中采用BSP模型實(shí)現(xiàn)大規(guī)模圖處理系統(tǒng)成為了主要的解決途徑。 本文旨在以BSP模型為核心,研究基于BSP模型的大規(guī)模圖處理系統(tǒng)中的消息通信原理和磁盤緩存技術(shù)的設(shè)計(jì)方案及其實(shí)現(xiàn)等問(wèn)題。提出了一種基于隊(duì)列的消息組織方式和通信方案,并在此基礎(chǔ)上提出了基于消息打包、多發(fā)送者線程池以及支持消息合并的優(yōu)化通信方案。針對(duì)基于BSP的大圖處理系統(tǒng)可能存在的內(nèi)存不足以存放計(jì)算中所有的圖和消息數(shù)據(jù)的問(wèn)題,本文建立了數(shù)據(jù)的內(nèi)存管理模型,并基于內(nèi)存優(yōu)先(Memory First)的思想,分別提出了圖數(shù)據(jù)和消息數(shù)據(jù)的磁盤緩存策略及相應(yīng)的算法:MF-GHIC算法、MLF圖數(shù)據(jù)遍歷算法和基于消息隊(duì)列優(yōu)先級(jí)的消息數(shù)據(jù)磁盤緩存算法等。 將本文提出的通信和緩存技術(shù)應(yīng)用于NEU-BSP系統(tǒng)中,我們通過(guò)實(shí)驗(yàn),首先分析了通信方案中各類參數(shù)的較優(yōu)值及其相互的制約關(guān)系;其次證明了在磁盤緩存率低于30%時(shí),系統(tǒng)的時(shí)間性能下降并不顯著;最后,我們以PageRank和單源最短路徑為例,通過(guò)與Hadoop系統(tǒng)的對(duì)比實(shí)驗(yàn),證明了在數(shù)據(jù)完全駐留內(nèi)存時(shí),NEU-BSP系統(tǒng)比Hadoop系統(tǒng)快1.2到18倍,在數(shù)據(jù)超過(guò)30%以上緩存到磁盤時(shí),NEU-BSP系統(tǒng)仍然能保持與Hadoop系統(tǒng)基本持平的時(shí)間性能。
[Abstract]:With the development of computer and network technology, the computer cluster used in parallel distributed computing capacity increase has become the development trend of computing mode. Cloud computing (Cloud Computing) is one of the main advantages is that its powerful parallel computing ability, and this ability is based on a simple and efficient parallel programming model on the basis of MapReduce. Among them, the most representative is the distributed parallel programming model proposed by Google. However, in recent years with the rapid development of Internet application, Web network, social network analysis and other large-scale network data has become a hot research topic, such as social networks in the shortest path, web search PageRank these problems. Graph processing usually requires multiple iterations, while the MapReduce big data set is suitable for general computation, with multiple iterations in processing The problem of qualitative graph mining will lead to suboptimal performance. These tend to be more suitable for the graph algorithm based on parallel model to deal with the.BSP message (Bulk Synchronous Parallel) bulk synchronous parallel model is in an asynchronous message passing parallel support, inter block explicit synchronization parallel computing model with Google. BSP model to realize large-scale graph processing system is proposed based on the Pregel, in a cloud environment using BSP model to realize large-scale graph processing system has become the main way.
The purpose of this paper is to use BSP model as the core, the research of BSP model in the large-scale system message communication principle and disk caching graph processing design and Realization Based on the presented. A news organization and communication scheme based on the queue, and on this basis put forward based on message package, optimized communication scheme for multi sender the thread pool and support message merge. In picture processing system of BSP are not enough memory to store all the maps and message data problems in calculation based on memory management model is established in this paper based on the data, and memory priority (Memory First) thoughts, respectively put forward disk caching strategy map data and message the data and the corresponding algorithm: MF-GHIC algorithm, MLF graph traversal algorithm and based on message queue priority message data disk cache algorithm.
The proposed communication and cache technology in NEU-BSP system, we first analyzed by experiments, the optimal value and the relation between the various parameters of mutual communication in the solution; it is proved that in the disk cache rate is less than 30%, the time performance of the system is not significantly decreased; at last, we use PageRank and single source shortest path as an example, through the contrast experiment with Hadoop system, proved that the complete data memory, NEU-BSP System 1.2 to 18 times faster than the Hadoop system, the data of more than 30% of the cache to disk, NEU-BSP system can still maintain the performance of time with basically the same as the Hadoop system.
【學(xué)位授予單位】:東北大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP338.6;TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 宋青;汪小帆;;最短路徑算法加速技術(shù)研究綜述[J];電子科技大學(xué)學(xué)報(bào);2012年02期
2 李稚楹;楊武;謝治軍;;PageRank算法研究綜述[J];計(jì)算機(jī)科學(xué);2011年S1期
3 于戈;谷峪;鮑玉斌;王志剛;;云計(jì)算環(huán)境下的大規(guī)模圖數(shù)據(jù)處理技術(shù)[J];計(jì)算機(jī)學(xué)報(bào);2011年10期
,本文編號(hào):1540905
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1540905.html
最近更新
教材專著