面向多核DSP的高性能并行BLAS3的設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-01-22 01:17
本文關(guān)鍵詞: 多核處理器 并行 線性代數(shù)庫 矩陣乘法 分塊算法 出處:《國防科學(xué)技術(shù)大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:BLAS庫在高性能計(jì)算領(lǐng)域中一直扮演著非常重要的角色,其體現(xiàn)的效率是高性能計(jì)算的主要評(píng)測標(biāo)準(zhǔn)。研究基于多核DSP的并行BLAS庫,對(duì)多核DSP在高性能計(jì)算領(lǐng)域的評(píng)測及應(yīng)用,對(duì)開發(fā)多核DSP的并行計(jì)算性能,都有著十分重要的現(xiàn)實(shí)意義。本文深入研究了BLAS3中的各個(gè)例程庫,設(shè)計(jì)和實(shí)現(xiàn)了基于C6678的單核GEMM、SYMM、SYRK、SYR2K及TRMM;基于C6678的多核通信與同步機(jī)制,設(shè)計(jì)并實(shí)現(xiàn)了并行的GEMM、SYMM、SYRK、SYR2K及TRMM。主要的研究工作包括以下幾個(gè)方面:1、設(shè)計(jì)和實(shí)現(xiàn)了基于C6678單核的GEMM。針對(duì)體系結(jié)構(gòu)的多級(jí)存儲(chǔ)特點(diǎn),對(duì)GEMM的核心循環(huán)在Cache級(jí)進(jìn)行了訪存比的性能比較和分析,結(jié)合C6678的硬件資源和體系結(jié)構(gòu)進(jìn)行了訪存優(yōu)化,對(duì)存儲(chǔ)空間進(jìn)行了合理的劃分,設(shè)計(jì)和實(shí)現(xiàn)了高性能的GEMM,經(jīng)過測試,性能達(dá)8.49 GFLOPS。2、設(shè)計(jì)和實(shí)現(xiàn)了基于C6678單核的BLAS3。詳細(xì)分析和研究了SYMM、SYRK、SYR2K及TRMM四個(gè)例程的運(yùn)算特點(diǎn);對(duì)SYMM中對(duì)稱矩陣的數(shù)據(jù)訪問進(jìn)行了優(yōu)化設(shè)計(jì);對(duì)SYRK的BP kernel更新對(duì)稱矩陣進(jìn)行了優(yōu)化設(shè)計(jì);對(duì)SYR2K的計(jì)算方式進(jìn)行了轉(zhuǎn)換使其可以直接調(diào)用SYRK的接口例程;對(duì)TRMM中三角矩陣的訪問進(jìn)行了分析,根據(jù)對(duì)角線的數(shù)據(jù)特點(diǎn)對(duì)BP kernel進(jìn)行了優(yōu)化設(shè)計(jì);結(jié)合C6678的硬件機(jī)制分別將SYMM、SYRK、SYR2K及TRMM四個(gè)例程高效地映射至C6678的單核結(jié)構(gòu)中,性能分別為8.241、8.102、8.008、8.203 GFLOPS。3、設(shè)計(jì)和實(shí)現(xiàn)了基于C6678的多核并行BLAS3。深入剖析了各個(gè)例程的算法規(guī)則,采用分塊的方式對(duì)數(shù)據(jù)進(jìn)行并行分解,使塊與塊之間的計(jì)算相互獨(dú)立,并優(yōu)化了多核間的負(fù)載均衡,結(jié)合C6678的多核通信及同步機(jī)制將并行的分塊算法高效地映射至多個(gè)核中,經(jīng)過性能測試,GEMM、SYMM、SYRK、SYR2K和TRMM等BLAS3例程的八核并行加速比分別為6.21、5.22、4.49、4.49和4.55。
[Abstract]:BLAS library has been playing a very important role in the field of high performance computing. Its efficiency is the main evaluation standard of high performance computing. The parallel BLAS library based on multi-core DSP is studied. The evaluation and application of multi-core DSP in the field of high performance computing is of great practical significance to the development of parallel computing performance of multi-core DSP. In this paper, every routine library in BLAS3 is deeply studied. The design and implementation of SYRK2K and TRMMMK based on C6678 are presented. Based on the multi-core communication and synchronization mechanism of C6678, a parallel SYMM-SYRK is designed and implemented. The main research work of SYR2K and TRMM. includes the following several aspects: 1. The design and implementation of Gem based on C6678 single core. This paper compares and analyzes the performance of the core cycle of GEMM at the Cache level, and combines the hardware resources and architecture of C6678 to optimize the memory access. The storage space is divided reasonably, and a high-performance GEMMM is designed and implemented. After testing, the performance reaches 8.49 GFLOPS.2. The BLAS3based on C6678 single core is designed and implemented. The operation characteristics of SYMMP SYRK SYR2K and TRMM are analyzed and studied in detail. The data access of symmetric matrix in SYMM is optimized. The optimized design of BP kernel renewal symmetric matrix of SYRK is presented. The calculation method of SYR2K is transformed so that it can directly call the interface routine of SYRK. This paper analyzes the access of triangular matrix in TRMM, and optimizes the design of BP kernel according to the characteristics of diagonal data. Combined with the hardware mechanism of C6678, the four routines of SYMMMM-SYRKT SYR2K and TRMM are mapped to the single core structure of C6678 efficiently, and the performance is 8.241 respectively. 8.102 / 8.008 / 8.203 GFLOPS.3. the multi-core parallel BLAS3 based on C6678 is designed and implemented. The algorithm rules of each routine are deeply analyzed. The data is decomposed in parallel by block, which makes the computation between blocks independent, and optimizes the load balance between multi-cores. Combined with the multi-core communication and synchronization mechanism of C6678, the parallel block algorithm is mapped to multiple cores efficiently, and the performance test is carried out. The parallel speedup ratios of BLAS3 routines such as SYR2K and TRMM are 6.21, 5.22, 4.49 and 4.55, respectively.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP338.6
,
本文編號(hào):1453163
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1453163.html
最近更新
教材專著