面向GPDSP科學計算的高性能DMA傳輸方式的設(shè)計與實現(xiàn)

發(fā)布時間：2018-03-11 08:17

本文選題：科學計算　切入點：GPDSP　出處：《國防科學技術(shù)大學》2015年碩士論文　論文類型：學位論文

【摘要】：高性能計算是當今計算科學研究所面臨的重大課題,而高性能計算主要涉及到算法研究和處理器設(shè)計。GPDSP處理器是我校自主正向設(shè)計的一款高性能多核微處理器,兼有通用處理器和數(shù)字信號處理器的優(yōu)勢。通過對HPL高性能評測標準的分析,發(fā)現(xiàn)影響HPL執(zhí)行效率的主要因素是矩陣更新操作,而矩陣更新操作是通過調(diào)用矩陣乘加運算(GEMM)實現(xiàn)的。GEMM的實現(xiàn)有許多種方式,大量的研究表明基于GEPP-GEBP思想的實現(xiàn)方案是GPDSP處理器中執(zhí)行效率最高的。本文結(jié)合GPDSP處理器的結(jié)構(gòu)特征和對GEMM實現(xiàn)方案的分析,設(shè)計了DMA的特殊傳輸方式。DMA的特殊傳輸方式包括DMA矩陣轉(zhuǎn)置傳輸、DMA分段傳輸和DMA核間同步傳輸,以及在DMA分段傳輸基礎(chǔ)上設(shè)計的DMA阻塞分段傳輸。DMA矩陣轉(zhuǎn)置傳輸是指把源存儲空間的二維數(shù)據(jù)塊搬移到目的存儲空間,并且在搬移過程中完成矩陣的轉(zhuǎn)置。DMA矩陣轉(zhuǎn)置傳輸?shù)膽?yīng)用可以極大地提高矩陣乘的運算效率。通過模擬驗證工具的測試,本文設(shè)計的DMA矩陣轉(zhuǎn)置傳輸?shù)膫鬏斝适莻鹘y(tǒng)矩陣轉(zhuǎn)置傳輸?shù)?.56倍以上。GEMM實現(xiàn)的大致思想是把核外的數(shù)據(jù)分成多個小塊,發(fā)送到多個DSP內(nèi)核存儲中進行運算,然后再把所有的運算結(jié)果搬移到核外存儲進行同步。因此,本文設(shè)計了DMA分段傳輸、DMA核間同步傳輸以及DMA阻塞分段傳輸。DMA分段傳輸可以快速地把核外存儲中的數(shù)據(jù)搬移到多個核的核內(nèi)存儲,而DMA核間同步傳輸則可以實現(xiàn)把多個核的核內(nèi)存儲中的數(shù)據(jù)快速地搬移到核外,另外,DMA阻塞分段傳輸可以有效隱藏數(shù)據(jù)搬移的時間。根據(jù)Cadence公司的NC-VERILOG模擬驗證工具的測試,DMA分段傳輸?shù)膫鬏斔俣仁莻鹘y(tǒng)傳輸方式傳輸速度的1.24倍以上。而DMA阻塞分段傳輸則可以使GEMM核心運算的時間至少減少3000拍。DMA核間同步傳輸?shù)钠骄鶄鬏斔俣仁莻鹘y(tǒng)傳輸方式傳輸速度的2.56倍。經(jīng)過充分的驗證和實驗測試,本文設(shè)計的DMA特殊傳輸方式滿足算法要求,可以有效提升HPL高性能評測標準的執(zhí)行效率。
[Abstract]:High performance computing is an important subject in the research of computational science nowadays. High performance computing mainly involves algorithm research and processor design. GPDSP processor is a high performance multi-core microprocessor designed independently and forward by our university. By analyzing the high performance evaluation standard of HPL, it is found that the main factor that affects the efficiency of HPL execution is matrix update operation. The matrix update operation is implemented by calling matrix multiplication and addition operations. There are many ways to implement the. GEMM. A large number of studies show that the implementation scheme based on GEPP-GEBP is the most efficient among GPDSP processors. This paper combines the structural characteristics of GPDSP processor and the analysis of GEMM implementation scheme. The special transmission mode of DMA. DMA includes DMA matrix transpose transmission and DMA core synchronous transmission. And the transpose transmission of DMA block segmented transmission. DMA matrix based on DMA segmented transmission means that the 2D data block of source storage space is moved to the destination storage space. And in the process of moving the matrix transpose. DMA matrix transpose transmission can greatly improve the efficiency of matrix multiplication. The transmission efficiency of DMA matrix transpose transmission designed in this paper is more than 1.56 times that of traditional matrix transpose transmission. The general idea of .GEMM realization is to divide the data out of the core into several small blocks and send them to multiple DSP kernel storage for operation. And then move all the results to out-of-core storage for synchronization. In this paper, we design DMA segmented transmission and DMA block segmented transmission. DMA segmented transmission can move the data from out-of-core storage to the core storage of multiple cores quickly. The synchronous transmission between DMA cores can quickly move the data from the core storage of multiple cores to the outside of the core. In addition, DMA-blocking segmented transmission can effectively hide the time of data transfer. According to the test of Cadence's NC-VERILOG simulation verification tool, the transmission speed of DMA segmented transmission is 1.24 times faster than that of traditional transmission mode, while DMA blocking is more than 1.24 times the speed of traditional transmission mode. The segmented transmission can reduce the operation time of the core of GEMM at least 3 000 beats. The average transmission speed of synchronous transmission between cores is 2.56 times that of the traditional transmission mode. The special transmission mode of DMA designed in this paper can meet the requirements of the algorithm and can effectively improve the efficiency of HPL high performance evaluation standard.
【學位授予單位】：國防科學技術(shù)大學
【學位級別】：碩士
【學位授予年份】：2015
【分類號】：TP332

【參考文獻】

相關(guān)期刊論文前9條

1 王占立;馬勝;許邦建;楊柳;;一種支持阻塞分段傳輸?shù)腄MA部件的設(shè)計與實現(xiàn)[J];計算機研究與發(fā)展;2014年S1期

2 夏健明;魏德敏;;用GPU加速求解線性方程組的高斯消元法[J];計算機工程與設(shè)計;2009年19期

3 王文義;王若雨;董紹靜;;高性能科學計算的特征分析及其實用方法研究[J];計算機科學;2008年09期

4 蔣孟奇;張云泉;宋剛;李玉成;;GOTOBLAS一般矩陣乘法高效實現(xiàn)機制的研究[J];計算機工程;2008年07期

5 張文力;陳明宇;樊建平;;HPL測試性能仿真與預(yù)測[J];計算機研究與發(fā)展;2006年03期

6 岳虹;沈立;戴葵;王志英;;DSP處理器和通用處理器的比較[J];計算機科學;2005年03期

7 王小牛,馮百明;基于存儲的矩陣乘積優(yōu)化算法[J];西北師范大學學報(自然科學版);2005年01期

8 陳建平;;LU分解遞歸算法的研究[J];計算機科學;2004年06期

9 李玉成,朱鵬;BLAS的加速方法與實現(xiàn)技術(shù)[J];數(shù)值計算與計算機應(yīng)用;1998年03期

，

本文編號：1597334

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1597334.html

上一篇：磁盤陣列系統(tǒng)掉電保護技術(shù)研究與實現(xiàn)
下一篇：NUMA架構(gòu)內(nèi)多個節(jié)點間訪存延時平衡的內(nèi)存分配策略

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向GPDSP科學計算的高性能DMA傳輸方式的設(shè)計與實現(xiàn)