天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 計算機論文 >

面向GPDSP科學計算的高性能DMA傳輸方式的設(shè)計與實現(xiàn)

發(fā)布時間:2018-03-11 08:17

  本文選題:科學計算 切入點:GPDSP 出處:《國防科學技術(shù)大學》2015年碩士論文 論文類型:學位論文


【摘要】:高性能計算是當今計算科學研究所面臨的重大課題,而高性能計算主要涉及到算法研究和處理器設(shè)計。GPDSP處理器是我校自主正向設(shè)計的一款高性能多核微處理器,兼有通用處理器和數(shù)字信號處理器的優(yōu)勢。通過對HPL高性能評測標準的分析,發(fā)現(xiàn)影響HPL執(zhí)行效率的主要因素是矩陣更新操作,而矩陣更新操作是通過調(diào)用矩陣乘加運算(GEMM)實現(xiàn)的。GEMM的實現(xiàn)有許多種方式,大量的研究表明基于GEPP-GEBP思想的實現(xiàn)方案是GPDSP處理器中執(zhí)行效率最高的。本文結(jié)合GPDSP處理器的結(jié)構(gòu)特征和對GEMM實現(xiàn)方案的分析,設(shè)計了DMA的特殊傳輸方式。DMA的特殊傳輸方式包括DMA矩陣轉(zhuǎn)置傳輸、DMA分段傳輸和DMA核間同步傳輸,以及在DMA分段傳輸基礎(chǔ)上設(shè)計的DMA阻塞分段傳輸。DMA矩陣轉(zhuǎn)置傳輸是指把源存儲空間的二維數(shù)據(jù)塊搬移到目的存儲空間,并且在搬移過程中完成矩陣的轉(zhuǎn)置。DMA矩陣轉(zhuǎn)置傳輸?shù)膽?yīng)用可以極大地提高矩陣乘的運算效率。通過模擬驗證工具的測試,本文設(shè)計的DMA矩陣轉(zhuǎn)置傳輸?shù)膫鬏斝适莻鹘y(tǒng)矩陣轉(zhuǎn)置傳輸?shù)?.56倍以上。GEMM實現(xiàn)的大致思想是把核外的數(shù)據(jù)分成多個小塊,發(fā)送到多個DSP內(nèi)核存儲中進行運算,然后再把所有的運算結(jié)果搬移到核外存儲進行同步。因此,本文設(shè)計了DMA分段傳輸、DMA核間同步傳輸以及DMA阻塞分段傳輸。DMA分段傳輸可以快速地把核外存儲中的數(shù)據(jù)搬移到多個核的核內(nèi)存儲,而DMA核間同步傳輸則可以實現(xiàn)把多個核的核內(nèi)存儲中的數(shù)據(jù)快速地搬移到核外,另外,DMA阻塞分段傳輸可以有效隱藏數(shù)據(jù)搬移的時間。根據(jù)Cadence公司的NC-VERILOG模擬驗證工具的測試,DMA分段傳輸?shù)膫鬏斔俣仁莻鹘y(tǒng)傳輸方式傳輸速度的1.24倍以上。而DMA阻塞分段傳輸則可以使GEMM核心運算的時間至少減少3000拍。DMA核間同步傳輸?shù)钠骄鶄鬏斔俣仁莻鹘y(tǒng)傳輸方式傳輸速度的2.56倍。經(jīng)過充分的驗證和實驗測試,本文設(shè)計的DMA特殊傳輸方式滿足算法要求,可以有效提升HPL高性能評測標準的執(zhí)行效率。
[Abstract]:High performance computing is an important subject in the research of computational science nowadays. High performance computing mainly involves algorithm research and processor design. GPDSP processor is a high performance multi-core microprocessor designed independently and forward by our university. By analyzing the high performance evaluation standard of HPL, it is found that the main factor that affects the efficiency of HPL execution is matrix update operation. The matrix update operation is implemented by calling matrix multiplication and addition operations. There are many ways to implement the. GEMM. A large number of studies show that the implementation scheme based on GEPP-GEBP is the most efficient among GPDSP processors. This paper combines the structural characteristics of GPDSP processor and the analysis of GEMM implementation scheme. The special transmission mode of DMA. DMA includes DMA matrix transpose transmission and DMA core synchronous transmission. And the transpose transmission of DMA block segmented transmission. DMA matrix based on DMA segmented transmission means that the 2D data block of source storage space is moved to the destination storage space. And in the process of moving the matrix transpose. DMA matrix transpose transmission can greatly improve the efficiency of matrix multiplication. The transmission efficiency of DMA matrix transpose transmission designed in this paper is more than 1.56 times that of traditional matrix transpose transmission. The general idea of .GEMM realization is to divide the data out of the core into several small blocks and send them to multiple DSP kernel storage for operation. And then move all the results to out-of-core storage for synchronization. In this paper, we design DMA segmented transmission and DMA block segmented transmission. DMA segmented transmission can move the data from out-of-core storage to the core storage of multiple cores quickly. The synchronous transmission between DMA cores can quickly move the data from the core storage of multiple cores to the outside of the core. In addition, DMA-blocking segmented transmission can effectively hide the time of data transfer. According to the test of Cadence's NC-VERILOG simulation verification tool, the transmission speed of DMA segmented transmission is 1.24 times faster than that of traditional transmission mode, while DMA blocking is more than 1.24 times the speed of traditional transmission mode. The segmented transmission can reduce the operation time of the core of GEMM at least 3 000 beats. The average transmission speed of synchronous transmission between cores is 2.56 times that of the traditional transmission mode. The special transmission mode of DMA designed in this paper can meet the requirements of the algorithm and can effectively improve the efficiency of HPL high performance evaluation standard.
【學位授予單位】:國防科學技術(shù)大學
【學位級別】:碩士
【學位授予年份】:2015
【分類號】:TP332

【參考文獻】

相關(guān)期刊論文 前9條

1 王占立;馬勝;許邦建;楊柳;;一種支持阻塞分段傳輸?shù)腄MA部件的設(shè)計與實現(xiàn)[J];計算機研究與發(fā)展;2014年S1期

2 夏健明;魏德敏;;用GPU加速求解線性方程組的高斯消元法[J];計算機工程與設(shè)計;2009年19期

3 王文義;王若雨;董紹靜;;高性能科學計算的特征分析及其實用方法研究[J];計算機科學;2008年09期

4 蔣孟奇;張云泉;宋剛;李玉成;;GOTOBLAS一般矩陣乘法高效實現(xiàn)機制的研究[J];計算機工程;2008年07期

5 張文力;陳明宇;樊建平;;HPL測試性能仿真與預(yù)測[J];計算機研究與發(fā)展;2006年03期

6 岳虹;沈立;戴葵;王志英;;DSP處理器和通用處理器的比較[J];計算機科學;2005年03期

7 王小牛,馮百明;基于存儲的矩陣乘積優(yōu)化算法[J];西北師范大學學報(自然科學版);2005年01期

8 陳建平;;LU分解遞歸算法的研究[J];計算機科學;2004年06期

9 李玉成,朱鵬;BLAS的加速方法與實現(xiàn)技術(shù)[J];數(shù)值計算與計算機應(yīng)用;1998年03期



本文編號:1597334

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1597334.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶7b2ff***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
九九热这里有精品20| 欧美日韩少妇精品专区性色| 国产精欧美一区二区三区久久| 精品亚洲一区二区三区w竹菊| 最近最新中文字幕免费| 丰满人妻熟妇乱又伦精另类视频| 老熟妇乱视频一区二区| 在线观看视频日韩精品| 国产又粗又长又爽又猛的视频| 欧美国产日产综合精品| 国语对白刺激高潮在线视频| 国产自拍欧美日韩在线观看| 国产亚洲精品俞拍视频福利区| 欧美丰满大屁股一区二区三区| 日韩性生活视频免费在线观看| 成人精品欧美一级乱黄| 欧美中文字幕日韩精品| 国产欧美高清精品一区| 久草热视频这里只有精品| 久久精品a毛片看国产成人| 欧美黑人巨大一区二区三区| 久久国产精品亚州精品毛片| 精品欧美国产一二三区| 免费精品国产日韩热久久| 九九九热在线免费视频| 国产精品亚洲欧美一区麻豆| 国产精品尹人香蕉综合网| 美国女大兵激情豪放视频播放| 久久精品偷拍视频观看| 久久精品色妇熟妇丰满人妻91| 爱草草在线观看免费视频| 97人摸人人澡人人人超碰| 久久三级国外久久久三级| 免费啪视频免费欧美亚洲| 亚洲伦理中文字幕在线观看 | 久久99午夜福利视频| 国产精欧美一区二区三区久久| 成人国产一区二区三区精品麻豆| 欧美一级不卡视频在线观看| 麻豆果冻传媒一二三区| 日韩不卡一区二区视频|