GPU集群的并行編程通信接口研究
發(fā)布時(shí)間:2018-01-19 15:42
本文關(guān)鍵詞: 圖形處理器集群 并行編程 集群通信 全局?jǐn)?shù)組 出處:《華中科技大學(xué)》2012年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:圖形處理器GPU善于處理大規(guī)模密集型數(shù)據(jù)和并行數(shù)據(jù),通用并行架構(gòu)CUDA讓GPU在通用計(jì)算領(lǐng)域越來(lái)越普及。由于GPU集群的高性?xún)r(jià)比,高性能計(jì)算領(lǐng)域中GPU集群的使用越來(lái)越普遍,但GPU集群并行編程并沒(méi)有一個(gè)標(biāo)準(zhǔn)的通信模型,絕大多數(shù)集群應(yīng)用采取CUDA+MPI的方法實(shí)現(xiàn),而CUDA和MPI編程都非常困難,需要程序員了解GPU硬件架構(gòu)和MPI消息傳遞機(jī)制,,顯式控制內(nèi)存與顯存、節(jié)點(diǎn)與節(jié)點(diǎn)間的數(shù)據(jù)傳輸。因此,對(duì)編程人員來(lái)說(shuō),GPU集群并行編程仍是一個(gè)復(fù)雜的問(wèn)題。 GPU集群通信接口CUDAGA結(jié)合分布式內(nèi)存上的共享內(nèi)存編程模型GA與通用并行架構(gòu)CUDA的特征,采用共享顯存方式,通過(guò)全局共享地址空間實(shí)現(xiàn)節(jié)點(diǎn)間GPU-to-GPU的數(shù)據(jù)通信,并通過(guò)內(nèi)部透明的CPU端臨時(shí)全局?jǐn)?shù)組和GPU端全局?jǐn)?shù)組來(lái)維護(hù)數(shù)據(jù)一致性,保證通信數(shù)據(jù)的正確性。同時(shí),該接口解決了多進(jìn)程多GPU環(huán)境下GPU設(shè)備的初始化問(wèn)題,并提供GPU集群信息查詢(xún)接口及圖形化監(jiān)控界面兩種方式,幫助用戶(hù)及時(shí)了解設(shè)備使用情況。此外,CUDAGA從數(shù)據(jù)傳輸和計(jì)算內(nèi)核兩方面對(duì)GA庫(kù)中的數(shù)組運(yùn)算進(jìn)行優(yōu)化,加速后的函數(shù)庫(kù)可供用戶(hù)直接使用。CUDAGA為用戶(hù)提供了一個(gè)簡(jiǎn)單方便的GPU集群并行編程通信接口,在保證通信性能的同時(shí)簡(jiǎn)化編程難度,提高程序員編寫(xiě)GPU集群應(yīng)用程序的效率。 選取并行矩陣乘Cannon算法和Jacobi迭代算法在GPU集群上的代碼實(shí)現(xiàn)和運(yùn)行為例,對(duì)GPU集群通信接口CUDAGA進(jìn)行測(cè)試。從編程復(fù)雜度與通信性能兩方面的測(cè)試結(jié)果可以看出,對(duì)于以數(shù)組為基本數(shù)據(jù)結(jié)構(gòu)、節(jié)點(diǎn)間通信量大且涉及大量數(shù)據(jù)訪(fǎng)問(wèn)操作的應(yīng)用,用CUDAGA編寫(xiě)的代碼的運(yùn)行性能要優(yōu)于用CUDA+MPI實(shí)現(xiàn)的版本,而且代碼長(zhǎng)度縮短一半以上,提高了程序編寫(xiě)的效率。
[Abstract]:Graphics processor GPU is good at dealing with large scale intensive data and parallel data. CUDA makes GPU become more and more popular in the field of general computing because of the high cost performance of GPU cluster. The use of GPU cluster is becoming more and more common in the field of high performance computing, but there is no standard communication model for GPU cluster parallel programming. Most cluster applications adopt the method of CUDA MPI. CUDA and MPI programming are very difficult, require programmers to understand the GPU hardware architecture and MPI messaging mechanism, explicit control of memory and memory, node to node data transmission. Parallel programming in GPU clusters is still a complex problem for programmers. GPU trunked communication interface (CUDAGA) combines the characteristics of shared memory programming model (GA) on distributed memory with that of CUDA, which is a general parallel architecture, and adopts shared video memory. The GPU-to-GPU data communication between nodes is realized through the global shared address space, and the data consistency is maintained through the internal transparent temporary global array on the CPU side and the global array on the GPU side. At the same time, the interface solves the initialization problem of GPU device in multi-process and multi-#en0# environment, and provides two ways of GPU cluster information query interface and graphical monitoring interface. In addition, CUDAGA optimizes the array operation in GA library from the aspects of data transmission and computing kernel. The accelerated function library can be used directly by the user. CUDAGA provides a simple and convenient communication interface for GPU cluster parallel programming, which simplifies the programming difficulty while ensuring the communication performance. Improve the efficiency of programmers writing GPU cluster applications. The parallel matrix multiplication Cannon algorithm and the Jacobi iterative algorithm are selected as an example to implement and run the code on the GPU cluster. The GPU trunked communication interface CUDAGA is tested. From the test results of programming complexity and communication performance, we can see that array is the basic data structure. The code written in CUDAGA has better performance than the version implemented in CUDA MPI, and the length of code is shortened by more than half because of the large amount of communication between nodes and the application of a large number of data access operations. The efficiency of programming is improved.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP338.6
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 陳華平 ;黃劉生 ;安虹 ;陳國(guó)良;;并行分布計(jì)算中的任務(wù)調(diào)度及其分類(lèi)[J];計(jì)算機(jī)科學(xué);2001年01期
2 程豪;張?jiān)迫?張先軼;李玉成;;CPU-GPU并行矩陣乘法的實(shí)現(xiàn)與性能分析[J];計(jì)算機(jī)工程;2010年13期
3 吳恩華,柳有權(quán);基于圖形處理器(GPU)的通用計(jì)算[J];計(jì)算機(jī)輔助設(shè)計(jì)與圖形學(xué)學(xué)報(bào);2004年05期
4 馮高鋒;;GPU-CPU集群上的動(dòng)態(tài)規(guī)劃算法[J];計(jì)算機(jī)應(yīng)用;2007年S2期
相關(guān)碩士學(xué)位論文 前1條
1 馬慶懷;基于CPU與GPU混合架構(gòu)集群的性能測(cè)試與優(yōu)化[D];中國(guó)地質(zhì)大學(xué)(北京);2011年
本文編號(hào):1444838
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1444838.html
最近更新
教材專(zhuān)著