面向千萬億次CPU-GPU異構(gòu)系統(tǒng)的編程模型與性能優(yōu)化關(guān)鍵技術(shù)研究

發(fā)布時間：2019-05-19 14:27

【摘要】：科學(xué)計算永無止境的計算需求驅(qū)動著高性能計算機(jī)系統(tǒng)進(jìn)入了千萬億次時代,面向千萬億次系統(tǒng)的各種關(guān)鍵技術(shù)將是未來構(gòu)建百萬萬億次系統(tǒng)的基石。受到CMOS工藝特征尺寸、功耗和散熱等技術(shù)的限制,完全依靠CPU提供計算能力的同構(gòu)計算機(jī)系統(tǒng)在到達(dá)千萬億次系統(tǒng)規(guī)模后很難再進(jìn)行擴(kuò)充。而使用GPU作為加速器的異構(gòu)系統(tǒng)在性能功耗比方面比同構(gòu)系統(tǒng)更有優(yōu)勢,也是構(gòu)建百萬萬億次系統(tǒng)最有前景的技術(shù)路線之一。2010年11月國防科大計算機(jī)學(xué)院為天津超算中心構(gòu)建的天河-1A使用了NVIDIA的Fermi GPU,以2.566 PFLOPS的可持續(xù)運算速度排名世界第一。這種CPU-GPU異構(gòu)系統(tǒng)提供了強(qiáng)大的計算能力,但用戶編程和性能優(yōu)化都與傳統(tǒng)的同構(gòu)計算機(jī)不同,成為發(fā)揮整個系統(tǒng)性能的關(guān)鍵。針對目前大規(guī)模異構(gòu)系統(tǒng)上應(yīng)用程序編程難、優(yōu)化難的問題,本文以千萬億次CPU-GPU異構(gòu)系統(tǒng)為平臺,研究了異構(gòu)系統(tǒng)的編程模型以及優(yōu)化方法。本文的主要創(chuàng)新點包括:1.首次在千萬億次CPU-GPU異構(gòu)計算機(jī)系統(tǒng)上引入了MPI/Open MP/Streaming混合編程模型,并擴(kuò)展至全系統(tǒng)規(guī)模。針對混合編程模型中軟件任務(wù)到硬資源映射的問題,提出了以結(jié)點為中心的任務(wù)映射、以CPU為中心的任務(wù)映射和以GPU為中心的任務(wù)映射。并針對大規(guī)模并行系統(tǒng)結(jié)點內(nèi)編程模型總結(jié)出7項需求:簡單易用性、性能可擴(kuò)展性、存儲可擴(kuò)展性、模型層次性、調(diào)度靈活性、模型異構(gòu)性、定位準(zhǔn)確性,用于評估目前的編程模型。另外,提出了基于共享內(nèi)存的多進(jìn)程共享使用GPU的方法,并給出了高效編2.基于測量的自適應(yīng)任務(wù)劃分技術(shù)。我們將所有的任務(wù)放在一個任務(wù)隊列中,循環(huán)地從任務(wù)隊列中獲取任務(wù),每次取出的任務(wù)根據(jù)當(dāng)前的“任務(wù)劃分比率”劃分成CPU執(zhí)行和加速器執(zhí)行兩部分,初始的“任務(wù)劃分比率”由CPU和加速器的理論計算峰值得到。劃分完畢之后在異構(gòu)平臺上執(zhí)行,并在執(zhí)行完畢進(jìn)行實際性能測量,將統(tǒng)計得到的性能結(jié)果和本次劃分的任務(wù)負(fù)載相結(jié)合,更新“任務(wù)劃分比率”,作為下次任務(wù)劃分的依據(jù)。由于每次任務(wù)劃分并執(zhí)行完畢后,任務(wù)劃分比率都被自適應(yīng)地調(diào)整,使得主機(jī)和加速器之間的任務(wù)分配獲得了很好的負(fù)載平衡效果,大大提升了異構(gòu)系統(tǒng)的計算效率。3.基于有限狀態(tài)自動機(jī)的嵌套雙緩沖軟件流水技術(shù)。GPU程序的執(zhí)行分為數(shù)據(jù)輸入、GPU計算、數(shù)據(jù)輸出三個部分。我們分析了異構(gòu)系統(tǒng)上軟件流水的執(zhí)行模型和代價模型,并設(shè)計了嵌套雙緩沖軟件流水機(jī)制。在實現(xiàn)過程中,我們使用了基于有限狀態(tài)自動機(jī)的方法,用單個CPU線程控制了多任務(wù)的輸入、執(zhí)行和輸出,并將三者有序的重疊執(zhí)行。實驗表明,這種方法極大緩解了主機(jī)和加速器間帶寬不足的問題,能有效解決原有GPU庫性能波動的問題。針對BLAS3中DGEMM不同問題規(guī)模的測試,平均性能提升達(dá)到7.61%。4.在千萬億次CPU-GPU異構(gòu)系統(tǒng)上設(shè)計并實現(xiàn)了高效的LINPACK程序(Hybrid-LINPACK)。首先設(shè)計并實現(xiàn)了能夠同時使用CPU和GPU計算能力的異構(gòu)BLAS庫,然后基于異構(gòu)BLAS庫,使用了MPI/Open MP/Streaming混合編程模型,結(jié)合同構(gòu)系統(tǒng)上的高性能LINPACK實現(xiàn)(HPL 2.0),實現(xiàn)并優(yōu)化了Hybrid-LINPACK。優(yōu)化方法主要涉及CPU與GPU的任務(wù)劃分、CPU與GPU的通信優(yōu)化、SWAP算法并行化優(yōu)化、結(jié)點間數(shù)據(jù)傳輸優(yōu)化、以及HPL傳統(tǒng)的優(yōu)化方法和參數(shù)調(diào)優(yōu)等。Hybrid-LINPACK充分發(fā)揮了硬件和體系結(jié)構(gòu)設(shè)計提供的強(qiáng)大計算和通信能力,在天河-1單個計算單元上比AMD發(fā)布的LINPACK實現(xiàn)取得了3.3倍的加速比,獲得70.1%的計算效率。最終全系統(tǒng)LINPACK測試在天河-1和天河-1A上分別取得了0.563PFLOPS和2.566 PFLOPS的實測性能。使得天河-1在2009年11月排名TOP500第五,天河-1A在2010年11月排名第一,都創(chuàng)下我國超級計算機(jī)TOP500排名歷史上的最好成績。
[Abstract]:The scientific calculation of the ever-ending computing demand drives a high-performance computer system into the billions of times, and the key technologies for the millions of systems will be the cornerstone of the future of a million-billion-dollar system. Due to the limitations of the CMOS process feature size, power consumption and heat dissipation, the isomorphic computer system, which is fully dependent on the CPU to provide the computing power, is difficult to expand after reaching the system size of millions of times. The use of the GPU as an accelerator is one of the most promising technology routes in performance-power-ratio, and is one of the most promising technical routes to build a million-trillion-dollar system. The Tianhe-1A, built by the University of Great Computer in Tianjin in November 2010, uses the NVIDIA's Fermi GPU, The world's first is ranked at 2.566 PFLOPS. The CPU-GPU heterogeneous system provides powerful computing power, but the programming and performance optimization of the user are different from the traditional homogeneous computer, and become the key to the performance of the whole system. In order to solve the problem that the application program is difficult and difficult to be optimized on the large-scale heterogeneous system, the programming model and the optimization method of the heterogeneous system are studied in this paper. The main innovation points of this paper include:1. The MPI/ Open MP/ Streaming mixed programming model was introduced for the first time on a million-million CPU-GPU heterogeneous computer system and extended to the system-wide scale. In order to solve the problem of software task-to-hard resource mapping in hybrid programming model, a node-centric task map, a CPU-centric task map and a GPU-centric task map are proposed. And the seven requirements are summarized for a large-scale parallel system node internal programming model: the simple usability, the performance expandability, the storage expandability, the model hierarchy, the scheduling flexibility, the model heterogeneity and the positioning accuracy, and is used for evaluating the current programming model. In addition, the method of multi-process sharing using the GPU based on shared memory is put forward, and the high-efficiency part 2 is given. Self-adaptive task partitioning technology based on measurement. We put all the tasks in a task queue, and the task is cyclically taken from the task queue. Each time the task is taken out is divided into two parts of the CPU execution and the accelerator according to the current "task division ratio". The initial "task division ratio" is obtained by the theoretical calculation peak of the CPU and the accelerator. After the division is finished, executing on a heterogeneous platform, performing actual performance measurement after the division is completed, combining the obtained performance results and the divided task load, and updating the "task division ratio" as the basis of the next task division. The task partition ratio is adaptively adjusted after each task is divided and executed, so that the task allocation between the host and the accelerator is well balanced, and the computing efficiency of the heterogeneous system is greatly improved. The invention relates to a nested double-buffering software running water technology based on a finite state automaton. The execution of the GPU program is divided into three parts: data input, GPU calculation, and data output. We analyzed the execution model and cost model of the software running water on the heterogeneous system, and designed the nested dual-buffer software running-water mechanism. In the course of implementation, we use a finite state automaton to control the input, execution, and output of multitask with a single CPU thread, and perform the orderly overlapping of the three. The experiment shows that this method greatly reduces the problem of insufficient bandwidth between the host and the accelerator, and can effectively solve the problem of the performance fluctuation of the original GPU library. The average performance of DGEMM in BLS3 was 7.61%. A high-efficiency LINPACK (Hybrid-LINPACK) program is designed and implemented on a 10 million CPU-GPU heterogeneous system. First of all, a heterogeneous BLAS library capable of simultaneously using CPU and GPU computing power is designed and implemented, and then the MPI/ Open MP/ Streaming mixed programming model is used based on the heterogeneous BLAS library, and the hybrid-LINPACK is realized and optimized in combination with the high-performance LINPACK implementation (HPL 2.0) on the homogeneous system. The optimization method is mainly concerned with the task division of the CPU and the GPU, the communication optimization of the CPU and the GPU, the parallel optimization of the SWAP algorithm, the optimization of data transmission among the nodes, and the optimization method and the parameter adjustment of the HPL tradition. Hybrid-LINPACK gives full play to the powerful computing and communication capabilities provided by the hardware and architecture design, and a 3.3-fold acceleration ratio is achieved on the Tianhe-1 single computing unit than the LINPACK issued by AMD, yielding 70.1% of the computational efficiency. The final system LINPACK test has obtained the measured performance of 0.563 PFLOPS and 2.566 PFLOPS on the Tianhe-1 and Tianhe-1A, respectively. The Tianhe-1 ranked the fifth of the TOP500 in November 2009, and the Tianhe-1A ranked the first in November 2010, all the best in the history of the TOP 500 of China's supercomputer.
【學(xué)位授予單位】：國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2013
【分類號】：TP338

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 歐陽t，

本文編號：2480800

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2480800.html

上一篇：兩種高性能芯片散熱器換熱性能比較研究
下一篇：企業(yè)私有云及分布式存儲技術(shù)在RS10中的研究及應(yīng)用

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向千萬億次CPU-GPU異構(gòu)系統(tǒng)的編程模型與性能優(yōu)化關(guān)鍵技術(shù)研究