典型圖像處理算法在Xeon Phi平臺上的實(shí)現(xiàn)與優(yōu)化技術(shù)研究

發(fā)布時(shí)間：2018-12-10 11:05

【摘要】：隨著異構(gòu)平臺的興起,高性能計(jì)算領(lǐng)域獲得快速的發(fā)展�；贑PU+GPU的異構(gòu)平臺在以生物信息學(xué)、醫(yī)學(xué)成像和計(jì)算流體力學(xué)等為代表的諸多領(lǐng)域獲得廣泛應(yīng)用。但是,CPU和GPU使用不同指令集和編程模型,對程序編程優(yōu)化有較高要求。Intel于2012年推出了基于眾核架構(gòu)的Xeon Phi協(xié)處理器,兼容傳統(tǒng)x86編程模型和特性,某種程度上降低了程序編程優(yōu)化的難度。Xeon Phi集成50個(gè)以上的x86輕量核,每個(gè)核支持4個(gè)硬件線程和512位SIMD向量處理,因而具有強(qiáng)大的并行處理能力。目前,使用Xeon Phi進(jìn)行算法優(yōu)化加速的研究尚處于起步階段。本文面向典型圖像處理算法在Xeon Phi平臺上的實(shí)現(xiàn)與加速展開研究。圖像處理算法對計(jì)算性能需求較高,具有數(shù)據(jù)量大和較高實(shí)時(shí)性的特點(diǎn)。本文選取了兩個(gè)代表性算法作為研究實(shí)例,分別是2D IDCT算法和3D GVF場算法。本文主要工作包括:(1)在Xeon Phi平臺上實(shí)現(xiàn)2D IDCT及相關(guān)優(yōu)化。首先依據(jù)行列分離計(jì)算原理串行實(shí)現(xiàn)2D IDCT,以此作為后續(xù)優(yōu)化的性能基準(zhǔn),然后采用512位SIMD和OpenMP對串行2D IDCT進(jìn)行向量化和線程擴(kuò)展,最后進(jìn)行數(shù)據(jù)預(yù)取優(yōu)化。實(shí)驗(yàn)結(jié)果表明,對單精度圖像格式,相比未向量化版本,向量化處理可獲得約5.84倍的性能加速,且算法性能隨線程擴(kuò)展近似線性增加;使用數(shù)據(jù)預(yù)取優(yōu)化可在已有優(yōu)化基礎(chǔ)上再獲得約1.24的性能加速。綜合來說,優(yōu)化后的2D IDCT算法在Xeon Phi上的最好性能相比在一顆E5-2670 CPU上的最好性能有約1.53倍的加速比。(2)在Xeon Phi平臺上實(shí)現(xiàn)3D GVF場計(jì)算及相關(guān)3D GVF場優(yōu)化。除討論向量化和線程擴(kuò)展等通用優(yōu)化外,側(cè)重在模板計(jì)算優(yōu)化對計(jì)算性能的影響,提出一種有效的循環(huán)分塊優(yōu)化策略,有效提高了緩存利用率。實(shí)驗(yàn)結(jié)果表明,對雙精度圖像格式,經(jīng)線程擴(kuò)展和向量化能顯著提升3D GVF場運(yùn)算性能,通過提出的分塊優(yōu)化策略,在問題規(guī)模為′′256256256和′′512512512時(shí),3D GVF在Xeon Phi上的計(jì)算性能在相比于在一顆E5-2670 CPU上的性能分別有約1.78和2.77的加速比。(3)歸納總結(jié)圖像處理算法在Xeon Phi平臺上的優(yōu)化規(guī)律,整理出有指導(dǎo)意義的優(yōu)化技術(shù),方便后續(xù)其他圖像處理算法的優(yōu)化。一般而言,對計(jì)算密集型的算法,直接采用諸如向量化和線程擴(kuò)展等通用優(yōu)化技術(shù)可獲得不錯(cuò)的性能提升;對計(jì)算訪存比較低的圖像處理算法,需要考慮提高緩存的利用效率,本文提出的循環(huán)分塊策略即是一種有效的方法。
[Abstract]:With the rise of heterogeneous platforms, the field of high performance computing has developed rapidly. Heterogeneous platforms based on CPU GPU are widely used in many fields, such as bioinformatics, medical imaging and computational fluid dynamics. However, CPU and GPU use different instruction sets and programming models, which have high requirements for programming optimization. Intel introduced a Xeon Phi coprocessor based on multi-core architecture in 2012, which is compatible with traditional x86 programming models and features. To some extent, the difficulty of programming optimization is reduced. Xeon Phi integrates more than 50 x86 lightweight kernels. Each kernel supports 4 hardware threads and 512-bit SIMD vector processing, so it has powerful parallel processing capability. At present, the research of optimization acceleration using Xeon Phi is still in its infancy. This paper focuses on the implementation and acceleration of typical image processing algorithms on Xeon Phi platform. Image processing algorithm requires high computational performance and has the characteristics of large amount of data and high real-time performance. In this paper, two representative algorithms, 2D IDCT algorithm and 3D GVF field algorithm, are selected as examples. The main work of this paper includes: (1) realize 2D IDCT and related optimization on Xeon Phi platform. Firstly, 2D IDCT, is realized serially according to the principle of column separation, and then the serial 2D IDCT is vectorized and threading extended by 512-bit SIMD and OpenMP. Finally, the data prefetching optimization is carried out. The experimental results show that the performance of vectorization can be accelerated by about 5.84 times compared with the non-vectorized version for single-precision image format, and the performance of the algorithm increases linearly with thread expansion. Using data prefetching optimization can gain about 1.24 performance acceleration on the basis of existing optimization. In general, the optimal performance of the optimized 2D IDCT algorithm on Xeon Phi is about 1.53 times faster than that on an E5-2670 CPU. (2) 3D GVF field calculation and related 3D GVF field optimization are realized on Xeon Phi platform. In addition to the general optimization such as vectorization and thread expansion, this paper focuses on the effect of template computing optimization on computing performance, and proposes an effective optimization strategy for circulatory blocking, which effectively improves the cache utilization rate. The experimental results show that the performance of 3D GVF field can be significantly improved by thread expansion and vectorization for the dual-precision image format. By the proposed block optimization strategy, the scale of the problem is' 256256256 'and' 51251252'. The computational performance of 3D GVF on Xeon Phi has a speedup ratio of about 1.78 and 2.77 respectively compared with that on an E5-2670 CPU. (3) the optimization law of image processing algorithm on Xeon Phi platform is summarized. The guiding optimization techniques are sorted out to facilitate the optimization of other image processing algorithms. In general, for computationally intensive algorithms, general optimization techniques such as vectorization and thread expansion can achieve good performance improvements. It is necessary to improve the efficiency of cache utilization for the image processing algorithm with low computational memory access. The circular blocking strategy proposed in this paper is an effective method.
【學(xué)位授予單位】：國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP38;TP391.41
，

本文編號：2370464

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2370464.html

上一篇：基于ATmega16單片機(jī)的智能快速充電機(jī)設(shè)計(jì)與研究
下一篇：具有公平帶寬分配高效實(shí)時(shí)磁盤調(diào)度研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

典型圖像處理算法在Xeon Phi平臺上的實(shí)現(xiàn)與優(yōu)化技術(shù)研究