典型圖像處理算法在Xeon Phi平臺(tái)上的實(shí)現(xiàn)與優(yōu)化技術(shù)研究
發(fā)布時(shí)間:2018-12-10 11:05
【摘要】:隨著異構(gòu)平臺(tái)的興起,高性能計(jì)算領(lǐng)域獲得快速的發(fā)展;贑PU+GPU的異構(gòu)平臺(tái)在以生物信息學(xué)、醫(yī)學(xué)成像和計(jì)算流體力學(xué)等為代表的諸多領(lǐng)域獲得廣泛應(yīng)用。但是,CPU和GPU使用不同指令集和編程模型,對(duì)程序編程優(yōu)化有較高要求。Intel于2012年推出了基于眾核架構(gòu)的Xeon Phi協(xié)處理器,兼容傳統(tǒng)x86編程模型和特性,某種程度上降低了程序編程優(yōu)化的難度。Xeon Phi集成50個(gè)以上的x86輕量核,每個(gè)核支持4個(gè)硬件線程和512位SIMD向量處理,因而具有強(qiáng)大的并行處理能力。目前,使用Xeon Phi進(jìn)行算法優(yōu)化加速的研究尚處于起步階段。本文面向典型圖像處理算法在Xeon Phi平臺(tái)上的實(shí)現(xiàn)與加速展開(kāi)研究。圖像處理算法對(duì)計(jì)算性能需求較高,具有數(shù)據(jù)量大和較高實(shí)時(shí)性的特點(diǎn)。本文選取了兩個(gè)代表性算法作為研究實(shí)例,分別是2D IDCT算法和3D GVF場(chǎng)算法。本文主要工作包括:(1)在Xeon Phi平臺(tái)上實(shí)現(xiàn)2D IDCT及相關(guān)優(yōu)化。首先依據(jù)行列分離計(jì)算原理串行實(shí)現(xiàn)2D IDCT,以此作為后續(xù)優(yōu)化的性能基準(zhǔn),然后采用512位SIMD和OpenMP對(duì)串行2D IDCT進(jìn)行向量化和線程擴(kuò)展,最后進(jìn)行數(shù)據(jù)預(yù)取優(yōu)化。實(shí)驗(yàn)結(jié)果表明,對(duì)單精度圖像格式,相比未向量化版本,向量化處理可獲得約5.84倍的性能加速,且算法性能隨線程擴(kuò)展近似線性增加;使用數(shù)據(jù)預(yù)取優(yōu)化可在已有優(yōu)化基礎(chǔ)上再獲得約1.24的性能加速。綜合來(lái)說(shuō),優(yōu)化后的2D IDCT算法在Xeon Phi上的最好性能相比在一顆E5-2670 CPU上的最好性能有約1.53倍的加速比。(2)在Xeon Phi平臺(tái)上實(shí)現(xiàn)3D GVF場(chǎng)計(jì)算及相關(guān)3D GVF場(chǎng)優(yōu)化。除討論向量化和線程擴(kuò)展等通用優(yōu)化外,側(cè)重在模板計(jì)算優(yōu)化對(duì)計(jì)算性能的影響,提出一種有效的循環(huán)分塊優(yōu)化策略,有效提高了緩存利用率。實(shí)驗(yàn)結(jié)果表明,對(duì)雙精度圖像格式,經(jīng)線程擴(kuò)展和向量化能顯著提升3D GVF場(chǎng)運(yùn)算性能,通過(guò)提出的分塊優(yōu)化策略,在問(wèn)題規(guī)模為′′256256256和′′512512512時(shí),3D GVF在Xeon Phi上的計(jì)算性能在相比于在一顆E5-2670 CPU上的性能分別有約1.78和2.77的加速比。(3)歸納總結(jié)圖像處理算法在Xeon Phi平臺(tái)上的優(yōu)化規(guī)律,整理出有指導(dǎo)意義的優(yōu)化技術(shù),方便后續(xù)其他圖像處理算法的優(yōu)化。一般而言,對(duì)計(jì)算密集型的算法,直接采用諸如向量化和線程擴(kuò)展等通用優(yōu)化技術(shù)可獲得不錯(cuò)的性能提升;對(duì)計(jì)算訪存比較低的圖像處理算法,需要考慮提高緩存的利用效率,本文提出的循環(huán)分塊策略即是一種有效的方法。
[Abstract]:With the rise of heterogeneous platforms, the field of high performance computing has developed rapidly. Heterogeneous platforms based on CPU GPU are widely used in many fields, such as bioinformatics, medical imaging and computational fluid dynamics. However, CPU and GPU use different instruction sets and programming models, which have high requirements for programming optimization. Intel introduced a Xeon Phi coprocessor based on multi-core architecture in 2012, which is compatible with traditional x86 programming models and features. To some extent, the difficulty of programming optimization is reduced. Xeon Phi integrates more than 50 x86 lightweight kernels. Each kernel supports 4 hardware threads and 512-bit SIMD vector processing, so it has powerful parallel processing capability. At present, the research of optimization acceleration using Xeon Phi is still in its infancy. This paper focuses on the implementation and acceleration of typical image processing algorithms on Xeon Phi platform. Image processing algorithm requires high computational performance and has the characteristics of large amount of data and high real-time performance. In this paper, two representative algorithms, 2D IDCT algorithm and 3D GVF field algorithm, are selected as examples. The main work of this paper includes: (1) realize 2D IDCT and related optimization on Xeon Phi platform. Firstly, 2D IDCT, is realized serially according to the principle of column separation, and then the serial 2D IDCT is vectorized and threading extended by 512-bit SIMD and OpenMP. Finally, the data prefetching optimization is carried out. The experimental results show that the performance of vectorization can be accelerated by about 5.84 times compared with the non-vectorized version for single-precision image format, and the performance of the algorithm increases linearly with thread expansion. Using data prefetching optimization can gain about 1.24 performance acceleration on the basis of existing optimization. In general, the optimal performance of the optimized 2D IDCT algorithm on Xeon Phi is about 1.53 times faster than that on an E5-2670 CPU. (2) 3D GVF field calculation and related 3D GVF field optimization are realized on Xeon Phi platform. In addition to the general optimization such as vectorization and thread expansion, this paper focuses on the effect of template computing optimization on computing performance, and proposes an effective optimization strategy for circulatory blocking, which effectively improves the cache utilization rate. The experimental results show that the performance of 3D GVF field can be significantly improved by thread expansion and vectorization for the dual-precision image format. By the proposed block optimization strategy, the scale of the problem is' 256256256 'and' 51251252'. The computational performance of 3D GVF on Xeon Phi has a speedup ratio of about 1.78 and 2.77 respectively compared with that on an E5-2670 CPU. (3) the optimization law of image processing algorithm on Xeon Phi platform is summarized. The guiding optimization techniques are sorted out to facilitate the optimization of other image processing algorithms. In general, for computationally intensive algorithms, general optimization techniques such as vectorization and thread expansion can achieve good performance improvements. It is necessary to improve the efficiency of cache utilization for the image processing algorithm with low computational memory access. The circular blocking strategy proposed in this paper is an effective method.
【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP38;TP391.41
,
本文編號(hào):2370464
[Abstract]:With the rise of heterogeneous platforms, the field of high performance computing has developed rapidly. Heterogeneous platforms based on CPU GPU are widely used in many fields, such as bioinformatics, medical imaging and computational fluid dynamics. However, CPU and GPU use different instruction sets and programming models, which have high requirements for programming optimization. Intel introduced a Xeon Phi coprocessor based on multi-core architecture in 2012, which is compatible with traditional x86 programming models and features. To some extent, the difficulty of programming optimization is reduced. Xeon Phi integrates more than 50 x86 lightweight kernels. Each kernel supports 4 hardware threads and 512-bit SIMD vector processing, so it has powerful parallel processing capability. At present, the research of optimization acceleration using Xeon Phi is still in its infancy. This paper focuses on the implementation and acceleration of typical image processing algorithms on Xeon Phi platform. Image processing algorithm requires high computational performance and has the characteristics of large amount of data and high real-time performance. In this paper, two representative algorithms, 2D IDCT algorithm and 3D GVF field algorithm, are selected as examples. The main work of this paper includes: (1) realize 2D IDCT and related optimization on Xeon Phi platform. Firstly, 2D IDCT, is realized serially according to the principle of column separation, and then the serial 2D IDCT is vectorized and threading extended by 512-bit SIMD and OpenMP. Finally, the data prefetching optimization is carried out. The experimental results show that the performance of vectorization can be accelerated by about 5.84 times compared with the non-vectorized version for single-precision image format, and the performance of the algorithm increases linearly with thread expansion. Using data prefetching optimization can gain about 1.24 performance acceleration on the basis of existing optimization. In general, the optimal performance of the optimized 2D IDCT algorithm on Xeon Phi is about 1.53 times faster than that on an E5-2670 CPU. (2) 3D GVF field calculation and related 3D GVF field optimization are realized on Xeon Phi platform. In addition to the general optimization such as vectorization and thread expansion, this paper focuses on the effect of template computing optimization on computing performance, and proposes an effective optimization strategy for circulatory blocking, which effectively improves the cache utilization rate. The experimental results show that the performance of 3D GVF field can be significantly improved by thread expansion and vectorization for the dual-precision image format. By the proposed block optimization strategy, the scale of the problem is' 256256256 'and' 51251252'. The computational performance of 3D GVF on Xeon Phi has a speedup ratio of about 1.78 and 2.77 respectively compared with that on an E5-2670 CPU. (3) the optimization law of image processing algorithm on Xeon Phi platform is summarized. The guiding optimization techniques are sorted out to facilitate the optimization of other image processing algorithms. In general, for computationally intensive algorithms, general optimization techniques such as vectorization and thread expansion can achieve good performance improvements. It is necessary to improve the efficiency of cache utilization for the image processing algorithm with low computational memory access. The circular blocking strategy proposed in this paper is an effective method.
【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP38;TP391.41
,
本文編號(hào):2370464
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2370464.html
最近更新
教材專著