基于多核處理器的并行3D-FFT研究
發(fā)布時間:2019-03-11 14:12
【摘要】:隨著計(jì)算任務(wù)量的增加,單核處理器已不能滿足用戶需求,主要原因是功耗問題限制了單核處理器不斷提高性能的途徑。而多核處理器的問世,為多任務(wù)、大數(shù)據(jù)難題提供了解決方案。3D-FFT是科學(xué)與工程研究中最重要的算法之一,在計(jì)算機(jī)視覺和模式識別,視頻電話和核磁共振成像算法中,都有著至關(guān)重要的作用。如何在多核處理器上更快速的進(jìn)行3D-FFT計(jì)算,是科學(xué)研究中首要面臨的難題。本文針對并行3D-FFT在多核處理上的應(yīng)用,從數(shù)據(jù)預(yù)處理,數(shù)據(jù)計(jì)算和數(shù)據(jù)轉(zhuǎn)置三個方面對3D-FFT算法進(jìn)行研究。在數(shù)據(jù)預(yù)處理階段,為避免出現(xiàn)有的核滿載,有的核空轉(zhuǎn)現(xiàn)象,采用負(fù)載均衡算法讓每個核心平均處理任務(wù)。同時,為了避免計(jì)算階段數(shù)據(jù)在核心之間不必要的遷移,提出線程與CPU親和性算法,讓數(shù)據(jù)在指定的CPU核心運(yùn)行。為解決共享內(nèi)存多核處理器偽共享難題,提出緩存行填充算法,使得屬于不同核心的數(shù)據(jù)被分在獨(dú)立的緩存行內(nèi)。在2核心與8核心下測試算法緩存命中率,實(shí)驗(yàn)表明,緩存行填充以及線程與CPU親和性算法,2核心下,一級數(shù)據(jù)緩存未命中率平均降低了0.2%,三級緩存未命中率平均降低了0.2%,8核心下分別降低了0.2%與0.1%。在數(shù)據(jù)計(jì)算階段,以多列FFT算法為基礎(chǔ),使用多線程并行化處理。對于FFT每次計(jì)算,使用經(jīng)典六步快速傅里葉變換算法把數(shù)據(jù)劃分成更小的數(shù)據(jù)量進(jìn)行計(jì)算,加快處理速度。對位反轉(zhuǎn)處理過程,采用兩端反轉(zhuǎn)策略進(jìn)行優(yōu)化,一次處理四個數(shù)據(jù)點(diǎn),使算法時間復(fù)雜度降為原來的一半。在數(shù)據(jù)轉(zhuǎn)置階段,提出全局轉(zhuǎn)置優(yōu)化算法,減少參與核間通信數(shù)據(jù)點(diǎn)的總數(shù),從而加快全局轉(zhuǎn)置速度。與現(xiàn)有全局轉(zhuǎn)置算法相比,通信時間平均減少0.05秒,約提高9.5%。并使用多核模擬器Sniper Simulator對其功耗進(jìn)行統(tǒng)計(jì),在8個計(jì)算核心下,全局轉(zhuǎn)置優(yōu)化算法與現(xiàn)有轉(zhuǎn)置算法,核間功耗,核與內(nèi)存功耗以及一級數(shù)據(jù)緩存功率消耗均值分別約為20W和22W,7.2W和7.5W以及16.9W和17.1W。為進(jìn)一步體現(xiàn)算法性能優(yōu)越性,與已有的FFTW和Open MP多線程進(jìn)行比較。在8核心下,與FFTW和Open MP緩存未命中率相比,3D-FFT全局轉(zhuǎn)置優(yōu)化算法緩存未命中率幾乎為0。與FFTW計(jì)算三維離散傅里葉變換的庫函數(shù)相比,在2核心下,3D-FFT全局轉(zhuǎn)置優(yōu)化算法算法性能是其1.59倍,8核心下是4.93倍。若對FFTW庫函數(shù)使用OpenMP開辟多個線程進(jìn)行加速,3D-FFT全局轉(zhuǎn)置優(yōu)化算法性能表現(xiàn)在2核心與8核心下分別是多線程Open MP的1.07倍與1.48倍。
[Abstract]:With the increase of computing tasks, single-core processors are no longer able to meet the needs of users. The main reason is that power consumption limits the way to improve the performance of single-core processors. 3D-FFT is one of the most important algorithms in scientific and engineering research, in computer vision and pattern recognition, video phone and magnetic resonance imaging algorithms, and the introduction of multi-core processors provides a solution to the problem of multi-task and big data. All have a vital role to play. How to compute 3D-FFT more quickly on multi-core processors is the most important problem in scientific research. Aiming at the application of parallel 3D-FFT in multi-core processing, this paper studies the 3D-FFT algorithm from three aspects: data preprocessing, data calculation and data transfer. In the data pre-processing phase, in order to avoid the existing core full load, some core idle phenomena, load balancing algorithm is used to make each core process tasks averagely. At the same time, in order to avoid unnecessary migration of data between the core in computing phase, an affinity algorithm between threads and CPU is proposed to make the data run in the specified CPU core. In order to solve the pseudo-sharing problem of shared memory multi-core processors, a cache row filling algorithm is proposed, so that the data belonging to different cores are divided into separate cache rows. The cache hit ratio of the algorithm is tested under 2 core and 8 core. The experiment shows that cache row filling and thread affinity with CPU algorithm, under the core 2, the first-level data cache miss ratio is reduced by 0.2%, on average, the cache hit ratio of the first-level data cache is reduced by 0.2% on average. The 3-tier cache miss rate was reduced by 0.2% on average and 0.2% and 0.1% at the core 8, respectively. In the data computing phase, based on the multi-column FFT algorithm, multi-thread parallelization is used. For each calculation of FFT, the classical six-step fast Fourier transform algorithm is used to divide the data into smaller data to calculate, so as to accelerate the processing speed. In order to reduce the time complexity of the algorithm to half of the original algorithm, the two-end inversion strategy is used to optimize the process of bit inversion, and four data points are processed at one time. In the data transposition phase, a global transposition optimization algorithm is proposed to reduce the total number of data points involved in inter-kernel communication, so as to accelerate the global transposing speed. Compared with the existing global transposition algorithm, the average communication time is reduced by 0.05 seconds, and the communication time is increased by 9.5%. And using the multi-core simulator Sniper Simulator to statistics its power consumption. Under eight computing cores, the global transpose optimization algorithm and the existing transpose algorithm, inter-core power consumption, kernel and memory power consumption, and the average power consumption of first-level data cache are about 20W and 22W, respectively. 7.2W and 7.5W and 16.9W and 17.1W In order to further reflect the performance advantages of the algorithm, compared with the existing FFTW and Open MP multithreading. At 8 core, compared with FFTW and Open MP cache miss ratio, the 3D-FFT global transpose optimization algorithm cache miss rate is almost 0. 5%. Compared with the library function of 3-D discrete Fourier transform calculated by FFTW, the performance of 3D-FFT global transposed optimization algorithm is 1.59 times under 2 core and 4.93 times at 8 core. If the FFTW library functions are accelerated by using OpenMP to open up multiple threads, the performance of the global transpose optimization algorithm of 3D-FFT is 1.07 times and 1.48 times higher than that of multi-thread OpenMP under 2-core and 8-core respectively.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP332
本文編號:2438347
[Abstract]:With the increase of computing tasks, single-core processors are no longer able to meet the needs of users. The main reason is that power consumption limits the way to improve the performance of single-core processors. 3D-FFT is one of the most important algorithms in scientific and engineering research, in computer vision and pattern recognition, video phone and magnetic resonance imaging algorithms, and the introduction of multi-core processors provides a solution to the problem of multi-task and big data. All have a vital role to play. How to compute 3D-FFT more quickly on multi-core processors is the most important problem in scientific research. Aiming at the application of parallel 3D-FFT in multi-core processing, this paper studies the 3D-FFT algorithm from three aspects: data preprocessing, data calculation and data transfer. In the data pre-processing phase, in order to avoid the existing core full load, some core idle phenomena, load balancing algorithm is used to make each core process tasks averagely. At the same time, in order to avoid unnecessary migration of data between the core in computing phase, an affinity algorithm between threads and CPU is proposed to make the data run in the specified CPU core. In order to solve the pseudo-sharing problem of shared memory multi-core processors, a cache row filling algorithm is proposed, so that the data belonging to different cores are divided into separate cache rows. The cache hit ratio of the algorithm is tested under 2 core and 8 core. The experiment shows that cache row filling and thread affinity with CPU algorithm, under the core 2, the first-level data cache miss ratio is reduced by 0.2%, on average, the cache hit ratio of the first-level data cache is reduced by 0.2% on average. The 3-tier cache miss rate was reduced by 0.2% on average and 0.2% and 0.1% at the core 8, respectively. In the data computing phase, based on the multi-column FFT algorithm, multi-thread parallelization is used. For each calculation of FFT, the classical six-step fast Fourier transform algorithm is used to divide the data into smaller data to calculate, so as to accelerate the processing speed. In order to reduce the time complexity of the algorithm to half of the original algorithm, the two-end inversion strategy is used to optimize the process of bit inversion, and four data points are processed at one time. In the data transposition phase, a global transposition optimization algorithm is proposed to reduce the total number of data points involved in inter-kernel communication, so as to accelerate the global transposing speed. Compared with the existing global transposition algorithm, the average communication time is reduced by 0.05 seconds, and the communication time is increased by 9.5%. And using the multi-core simulator Sniper Simulator to statistics its power consumption. Under eight computing cores, the global transpose optimization algorithm and the existing transpose algorithm, inter-core power consumption, kernel and memory power consumption, and the average power consumption of first-level data cache are about 20W and 22W, respectively. 7.2W and 7.5W and 16.9W and 17.1W In order to further reflect the performance advantages of the algorithm, compared with the existing FFTW and Open MP multithreading. At 8 core, compared with FFTW and Open MP cache miss ratio, the 3D-FFT global transpose optimization algorithm cache miss rate is almost 0. 5%. Compared with the library function of 3-D discrete Fourier transform calculated by FFTW, the performance of 3D-FFT global transposed optimization algorithm is 1.59 times under 2 core and 4.93 times at 8 core. If the FFTW library functions are accelerated by using OpenMP to open up multiple threads, the performance of the global transpose optimization algorithm of 3D-FFT is 1.07 times and 1.48 times higher than that of multi-thread OpenMP under 2-core and 8-core respectively.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP332
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 劉益群;李焱;張?jiān)迫?張先軼;;Memory Efficient Two-Pass 3D FFT Algorithm for Intel~ Xeon Phi~(TM) Coprocessor[J];Journal of Computer Science and Technology;2014年06期
2 高麗,劉衛(wèi)新,張學(xué)智;FFT標(biāo)準(zhǔn)整序算法的優(yōu)化[J];探測與控制學(xué)報(bào);2004年02期
,本文編號:2438347
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2438347.html
最近更新
教材專著