面向多核向量處理器的FFT算法設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-01-01 06:09
本文關(guān)鍵詞:面向多核向量處理器的FFT算法設(shè)計(jì)與實(shí)現(xiàn) 出處:《國(guó)防科學(xué)技術(shù)大學(xué)》2014年碩士論文 論文類型:學(xué)位論文
更多相關(guān)文章: FFT 融合乘加 多核處理器 向量化 軟件流水
【摘要】:FFT算法作為數(shù)字信號(hào)處理的主要工具,在高性能計(jì)算領(lǐng)域中扮演著重要的角色,是衡量處理器性能的重要指標(biāo)。針對(duì)多核向量X-DSP的體系結(jié)構(gòu)的特點(diǎn),研究高效的FFT向量化設(shè)計(jì)與實(shí)現(xiàn)方法具有重要的理論意義和應(yīng)用價(jià)值。本文深入分析了FFT算法的特性,成功設(shè)計(jì)并實(shí)現(xiàn)了基2 FFT、基4 FFT、大點(diǎn)數(shù)FFT和混合基FFT算法程序。本文主要研究工作包括以下幾個(gè)方面:(1)設(shè)計(jì)和實(shí)現(xiàn)了基于X-DSP的基2 FFT算法程序。針對(duì)具有融合乘加的體系結(jié)構(gòu)特點(diǎn),分析及優(yōu)化了DIT和DIF基2 FFT的蝶形單元,充分利用了融合乘加指令,提高了FLOPS吞吐率;同時(shí)將混洗請(qǐng)求與訪存請(qǐng)求相結(jié)合,且利用軟件流水的方法進(jìn)行優(yōu)化,提升了程序的執(zhí)行效率。實(shí)驗(yàn)結(jié)果表明:相比CUFFT庫(kù)的性能,單精度和雙精度基2 FFT的平均性能分別提高3.12倍和22.97倍;相比FFTW庫(kù)的性能,單精度和雙精度基2 FFT的平均性能分別提高3.52倍和25.29倍。(2)設(shè)計(jì)和實(shí)現(xiàn)了基于X-DSP的基4 FFT算法程序。充分利用融合乘加指令優(yōu)化了DIT和DIF基4 FFT的蝶形單元,同時(shí)將混洗請(qǐng)求與訪存請(qǐng)求相結(jié)合。實(shí)驗(yàn)結(jié)果表明:相比基2 FFT,DIT基4相比DIT基2的性能提升了11.46%-21.34%;DIF基4 FFT相比DIF基2 FFT的平均性能提升了9.1%。(3)設(shè)計(jì)和實(shí)現(xiàn)了基于X-DSP的大點(diǎn)數(shù)FFT算法程序。詳細(xì)分析了大點(diǎn)數(shù)FFT算法MFA和迭代FFT,設(shè)計(jì)并優(yōu)化了基于DMA雙緩沖的單核程序;提出了一種壓縮存儲(chǔ)系數(shù)因子的方法節(jié)省存儲(chǔ)空間,將并行的MFA分塊算法映射到多個(gè)核中,優(yōu)化了多核間的負(fù)載平衡,從而高效地實(shí)現(xiàn)了多核并行的大點(diǎn)數(shù)FFT算法,平均加速比達(dá)到6.43,取得了較高的性能加速比。(4)分析并優(yōu)化了DIT基3和基5 FFT的蝶形單元,設(shè)計(jì)和實(shí)現(xiàn)了基于X-DSP的混合基FFT算法程序。實(shí)驗(yàn)結(jié)果表明,單精度浮點(diǎn)1536點(diǎn)和2400點(diǎn)混合基的計(jì)算時(shí)間分別為0.00247ms和0.00348ms,取得了較高的計(jì)算性能,并且隨著點(diǎn)數(shù)的增加,混合基FFT計(jì)算性能明顯提升。
[Abstract]:The FFT algorithm as the main tool of digital signal processing in high performance computing plays an important role in the field, is an important indicator to measure the performance of processor. According to the characteristics of architecture of multi core vector X-DSP, efficient FFT to research and implement method of quantitative design has important theoretical significance and application value. This paper deeply analyzes the the characteristics of FFT algorithm, the success of the design and implementation of FFT based 2, base 4 FFT, large point FFT and mixed base FFT algorithm. The main research work includes the following aspects: (1) the design and implementation of X-DSP based on FFT algorithm. In view of the 2 architecture features fused multiply add, analysis and the optimization of DIT and DIF based FFT butterfly unit 2, make full use of the fused multiply add instruction, improve the FLOPS throughput; at the same time will shuffle and request access request combination method and the use of software pipelining and optimize And improve the execution efficiency of the program. The experimental results show that the performance of CUFFT base compared to the average performance of single and double precision 2 FFT were increased by 3.12 times and 22.97 times; the performance of FFTW base compared to the average performance of single and double precision 2 FFT were increased by 3.52 times and 25.29 times (. 2) the design and implementation of X-DSP based on 4 FFT algorithm program. Make full use of the butterfly unit fused multiply add instruction optimization DIT and DIF 4 FFT, while the shuffle request and the access request combination. The experimental results show that compared to the base 2 FFT, DIT 4 compared to DIT 2 of base to enhance the 11.46%-21.34%; DIF based 4 FFT compared to the DIF average performance of 2 FFT 9.1%. (3) to enhance the design and implementation of the program for large point FFT algorithm based on X-DSP. A detailed analysis of the large point FFT MFA and FFT iterative algorithm, the design and optimization of the single nuclear program based on DMA double buffering is proposed; pressure Method of shrink storage coefficient factor to save storage space, the parallel MFA block algorithm is mapped to multiple cores, optimization of load balancing between multiple cores, so as to efficiently implement a large point FFT multi-core parallel algorithm, the average speedup ratio reached 6.43, achieved high performance and speed ratio (4). The butterfly unit analysis and optimization of the DIT base 3 and base 5 FFT, the design and implementation of the program of hybrid based FFT algorithm based on X-DSP. The experimental results show that the computation time of 1536 points and 2400 points mixed base single precision floating point were 0.00247ms and 0.00348ms, has high computational performance, and with the increase of the number of FFT, mixed computing performance improved significantly.
【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP332
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 高振斌;王霞;;超長(zhǎng)點(diǎn)數(shù)FFT處理器的旋轉(zhuǎn)因子生成方法[J];電訊技術(shù);2007年06期
2 李新社,易亞星,李忠科;FFT中旋轉(zhuǎn)因子生成算法的研究[J];航空計(jì)算技術(shù);2000年03期
,本文編號(hào):1363248
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1363248.html
最近更新
教材專著