Backprojection成像多核并行計算系統(tǒng)設(shè)計

發(fā)布時間：2018-03-26 11:46

本文選題：Backprojection算法　切入點：雷達(dá)成像　出處：《南京大學(xué)》2013年碩士論文

【摘要】：Backprojection雷達(dá)成像算法運算量極大,對成像系統(tǒng)的性能提出了極高的要求。本文在分析算法特征的基礎(chǔ)上充分利用多種并行計算技術(shù)設(shè)計了一款高性能Backprojection雷達(dá)成像系統(tǒng),提出并實現(xiàn)了多種提高性能的關(guān)鍵技術(shù)。針對算法的脈沖預(yù)處理部分包含大量的大點數(shù)復(fù)數(shù)向量運算和大點數(shù)FFT運算的特點,設(shè)計了一種直接支持FFT加速指令的SIMD向量處理器。出于性能的考慮,FFT在以往的系統(tǒng)設(shè)計中都是通過硬件加速器完成,而該SIMD向量處理器不僅能高效地完成算法的脈沖預(yù)處理部分所有大點數(shù)向量運算,還直接支持FFT加速指令,而且該FFT加速指令能提供和專用硬件加速器相同的FFT加速效率,因此避免了在系統(tǒng)中再增加硬件加速器所帶來的額外硬件開銷。針對算法反投影運算部分對性能要求極高的特點,設(shè)計了反投影加速器,其功能是把經(jīng)過預(yù)處理之后的脈沖數(shù)據(jù)反投影到圖像上的每一個像素點,性能達(dá)到每個時鐘周期完成對一個像素點的反投影。在充分的誤差分析的基礎(chǔ)上,通過使用合理設(shè)計的定點表示代替雙精度浮點表示,不僅使邏輯資源的開銷降低了約50%,片上存儲器資源的開銷降低了37.5%,而且還提高了運算精度,相位的最大誤差由11°縮小到了1.4° 由于成像算法的反投影部分運算量之大以至于一個反投影加速器遠(yuǎn)不能滿足系統(tǒng)性能要求,本文通過把多個反投影加速器集成為一個反投影子系統(tǒng)以并行計算的方式進(jìn)一步提高計算性能,這涉及到反投影算法的并行化以及并行算法向多個計算單元映射的問題。本文在原始的像素并行方案的基礎(chǔ)上設(shè)計了脈沖并行方案,并重新設(shè)計了反投影子系統(tǒng)的架構(gòu),對于集成了8個反投影加速核的反投影子系統(tǒng),主存儲器的訪存帶寬需求和片上像素存儲器組的數(shù)量均降低了87.5%.相比于單個反投影加速器,以完全相同的片上像素存儲器、完全相同的主存儲器訪存帶寬和8倍的反投影加速核和片上脈沖存儲器取得了大于7.99的加速比。此外,針對開發(fā)過程中算法仿真時間過長的問題,本文還嘗試了通過GPU并行計算的方法加速Backprojection雷達(dá)成像算法仿真。結(jié)合GPU計算平臺和算法的特征分析,選擇了像素并行的方案進(jìn)行加速,原來需要仿真時間5小時23分鐘經(jīng)過GPU加速后只需要3分20秒,加速比達(dá)到97倍。
[Abstract]:The Backprojection radar imaging algorithm has a great deal of computation, and the performance of the imaging system is very high. Based on the analysis of the characteristics of the algorithm, a high performance Backprojection radar imaging system is designed based on a variety of parallel computing techniques. Several key techniques to improve performance are proposed and implemented. The pulse preprocessing part of the algorithm includes a large number of large number of complex vector operations and large number of FFT operations. A SIMD vector processor which directly supports FFT acceleration instructions is designed. The SIMD vector processor can not only efficiently perform all the large number vector operations in the pulse preprocessing part of the algorithm, but also directly support the FFT acceleration instruction, and the FFT acceleration instruction can provide the same FFT acceleration efficiency as the dedicated hardware accelerator. Therefore, the additional hardware overhead caused by adding hardware accelerators to the system is avoided. In view of the high performance requirement of the backprojection operation part of the algorithm, a backprojection accelerator is designed, the function of which is to project the preprocessed pulse data back to every pixel point on the image. The performance of each clock cycle is achieved by completing the backprojection of a pixel. On the basis of sufficient error analysis, a reasonably designed fixed-point representation is used instead of a double-precision floating-point representation. It not only reduces the cost of logical resources by about 50, but also reduces the overhead of on-chip memory resources by 37.5 degrees, and improves the operation accuracy. The maximum phase error is reduced from 11 擄to 1.4 擄. Because the backprojection part of the imaging algorithm is so large that a backprojection accelerator can not meet the performance requirements of the system, In this paper, by integrating multiple backprojection accelerators into a backcast shadow system, the computational performance is further improved by parallel computing. This involves the parallelization of backprojection algorithms and the mapping of parallel algorithms to multiple computing units. In this paper, the pulse parallel scheme is designed based on the original pixel parallel scheme, and the architecture of the backcast shadow system is redesigned. For the backshot shadow system integrated with 8 backprojection acceleration cores, the memory access bandwidth requirements of the main memory and the number of on-chip pixel memory groups are reduced by 87.5. The memory access bandwidth of the main memory is exactly the same as that of the backprojection accelerating core and the on-chip pulse memory, and the speedup ratio is greater than 7.99. In addition, in order to solve the problem that the simulation time of the algorithm is too long, this paper also tries to accelerate the simulation of Backprojection radar imaging algorithm by GPU parallel computing, combining with the characteristic analysis of GPU computing platform and algorithm. The pixel parallel scheme is chosen for acceleration. The simulation time is 5 hours and 23 minutes, only 3 minutes and 20 seconds after GPU acceleration, and the speedup is 97 times.
【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP338.6

【參考文獻(xiàn)】

相關(guān)期刊論文前5條

1 何晶;陳家新;黎蔚;;基于GPU的實時光線投射算法[J];計算機(jī)工程與應(yīng)用;2008年09期

2 盧風(fēng)順;宋君強(qiáng);銀�？�;張理論;;CPU/GPU協(xié)同并行計算研究綜述[J];計算機(jī)科學(xué);2011年03期

3 吳恩華,柳有權(quán);基于圖形處理器(GPU)的通用計算[J];計算機(jī)輔助設(shè)計與圖形學(xué)學(xué)報;2004年05期

4 沈緒榜;劉澤響;王茹;;計算機(jī)體系結(jié)構(gòu)的統(tǒng)一模型[J];計算機(jī)學(xué)報;2007年05期

5 吳恩華;圖形處理器用于通用計算的技術(shù)、現(xiàn)狀及其挑戰(zhàn)[J];軟件學(xué)報;2004年10期

相關(guān)博士學(xué)位論文前4條

1 嚴(yán)明;面向領(lǐng)域應(yīng)用的異構(gòu)多核SoC系統(tǒng)結(jié)構(gòu)設(shè)計與優(yōu)化[D];國防科學(xué)技術(shù)大學(xué);2011年

2 馬安國;高效能GPGPU體系結(jié)構(gòu)關(guān)鍵技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年

3 白洪濤;基于GPU的高性能并行算法研究[D];吉林大學(xué);2010年

4 許牧;可重構(gòu)眾核流處理器體系結(jié)構(gòu)關(guān)鍵技術(shù)研究[D];中國科學(xué)技術(shù)大學(xué);2012年

相關(guān)碩士學(xué)位論文前1條

1 張帆;基于GPU加速的醫(yī)學(xué)圖像3D實時繪制技術(shù)[D];電子科技大學(xué);2009年

，

本文編號：1667766

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1667766.html

上一篇：一種基于PCIe固態(tài)硬盤存儲系統(tǒng)的設(shè)計和實現(xiàn)
下一篇：基于二進(jìn)制插樁的共享指令集異構(gòu)多核處理器進(jìn)程遷移方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Backprojection成像多核并行計算系統(tǒng)設(shè)計