GPU集群的并行編程通信接口研究

發(fā)布時(shí)間：2018-01-19 15:42

本文關(guān)鍵詞： 圖形處理器集群并行編程集群通信全局?jǐn)?shù)組　出處：《華中科技大學(xué)》2012年碩士論文　論文類型：學(xué)位論文

【摘要】：圖形處理器GPU善于處理大規(guī)模密集型數(shù)據(jù)和并行數(shù)據(jù)，通用并行架構(gòu)CUDA讓GPU在通用計(jì)算領(lǐng)域越來越普及。由于GPU集群的高性價(jià)比，高性能計(jì)算領(lǐng)域中GPU集群的使用越來越普遍，但GPU集群并行編程并沒有一個(gè)標(biāo)準(zhǔn)的通信模型，絕大多數(shù)集群應(yīng)用采取CUDA+MPI的方法實(shí)現(xiàn)，而CUDA和MPI編程都非常困難，需要程序員了解GPU硬件架構(gòu)和MPI消息傳遞機(jī)制，，顯式控制內(nèi)存與顯存、節(jié)點(diǎn)與節(jié)點(diǎn)間的數(shù)據(jù)傳輸。因此，對編程人員來說，GPU集群并行編程仍是一個(gè)復(fù)雜的問題。 GPU集群通信接口CUDAGA結(jié)合分布式內(nèi)存上的共享內(nèi)存編程模型GA與通用并行架構(gòu)CUDA的特征，采用共享顯存方式，通過全局共享地址空間實(shí)現(xiàn)節(jié)點(diǎn)間GPU-to-GPU的數(shù)據(jù)通信，并通過內(nèi)部透明的CPU端臨時(shí)全局?jǐn)?shù)組和GPU端全局?jǐn)?shù)組來維護(hù)數(shù)據(jù)一致性，保證通信數(shù)據(jù)的正確性。同時(shí)，該接口解決了多進(jìn)程多GPU環(huán)境下GPU設(shè)備的初始化問題，并提供GPU集群信息查詢接口及圖形化監(jiān)控界面兩種方式，幫助用戶及時(shí)了解設(shè)備使用情況。此外，CUDAGA從數(shù)據(jù)傳輸和計(jì)算內(nèi)核兩方面對GA庫中的數(shù)組運(yùn)算進(jìn)行優(yōu)化，加速后的函數(shù)庫可供用戶直接使用。CUDAGA為用戶提供了一個(gè)簡單方便的GPU集群并行編程通信接口，在保證通信性能的同時(shí)簡化編程難度，提高程序員編寫GPU集群應(yīng)用程序的效率。選取并行矩陣乘Cannon算法和Jacobi迭代算法在GPU集群上的代碼實(shí)現(xiàn)和運(yùn)行為例，對GPU集群通信接口CUDAGA進(jìn)行測試。從編程復(fù)雜度與通信性能兩方面的測試結(jié)果可以看出，對于以數(shù)組為基本數(shù)據(jù)結(jié)構(gòu)、節(jié)點(diǎn)間通信量大且涉及大量數(shù)據(jù)訪問操作的應(yīng)用，用CUDAGA編寫的代碼的運(yùn)行性能要優(yōu)于用CUDA+MPI實(shí)現(xiàn)的版本，而且代碼長度縮短一半以上，提高了程序編寫的效率。
[Abstract]:Graphics processor GPU is good at dealing with large scale intensive data and parallel data. CUDA makes GPU become more and more popular in the field of general computing because of the high cost performance of GPU cluster. The use of GPU cluster is becoming more and more common in the field of high performance computing, but there is no standard communication model for GPU cluster parallel programming. Most cluster applications adopt the method of CUDA MPI. CUDA and MPI programming are very difficult, require programmers to understand the GPU hardware architecture and MPI messaging mechanism, explicit control of memory and memory, node to node data transmission. Parallel programming in GPU clusters is still a complex problem for programmers. GPU trunked communication interface (CUDAGA) combines the characteristics of shared memory programming model (GA) on distributed memory with that of CUDA, which is a general parallel architecture, and adopts shared video memory. The GPU-to-GPU data communication between nodes is realized through the global shared address space, and the data consistency is maintained through the internal transparent temporary global array on the CPU side and the global array on the GPU side. At the same time, the interface solves the initialization problem of GPU device in multi-process and multi-#en0# environment, and provides two ways of GPU cluster information query interface and graphical monitoring interface. In addition, CUDAGA optimizes the array operation in GA library from the aspects of data transmission and computing kernel. The accelerated function library can be used directly by the user. CUDAGA provides a simple and convenient communication interface for GPU cluster parallel programming, which simplifies the programming difficulty while ensuring the communication performance. Improve the efficiency of programmers writing GPU cluster applications. The parallel matrix multiplication Cannon algorithm and the Jacobi iterative algorithm are selected as an example to implement and run the code on the GPU cluster. The GPU trunked communication interface CUDAGA is tested. From the test results of programming complexity and communication performance, we can see that array is the basic data structure. The code written in CUDAGA has better performance than the version implemented in CUDA MPI, and the length of code is shortened by more than half because of the large amount of communication between nodes and the application of a large number of data access operations. The efficiency of programming is improved.
【學(xué)位授予單位】：華中科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP338.6

【參考文獻(xiàn)】

相關(guān)期刊論文前4條

1 陳華平 ;黃劉生 ;安虹 ;陳國良;;并行分布計(jì)算中的任務(wù)調(diào)度及其分類[J];計(jì)算機(jī)科學(xué);2001年01期

2 程豪;張?jiān)迫?張先軼;李玉成;;CPU-GPU并行矩陣乘法的實(shí)現(xiàn)與性能分析[J];計(jì)算機(jī)工程;2010年13期

3 吳恩華,柳有權(quán);基于圖形處理器(GPU)的通用計(jì)算[J];計(jì)算機(jī)輔助設(shè)計(jì)與圖形學(xué)學(xué)報(bào);2004年05期

4 馮高鋒;;GPU-CPU集群上的動(dòng)態(tài)規(guī)劃算法[J];計(jì)算機(jī)應(yīng)用;2007年S2期

相關(guān)碩士學(xué)位論文前1條

1 馬慶懷;基于CPU與GPU混合架構(gòu)集群的性能測試與優(yōu)化[D];中國地質(zhì)大學(xué)（北京）;2011年

本文編號：1444838

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1444838.html

上一篇：基于netlink的linux服務(wù)器集群統(tǒng)一外設(shè)事件監(jiān)聽機(jī)制
下一篇：基于精確再生碼的秘密共享方案

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

GPU集群的并行編程通信接口研究