天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向千萬(wàn)億次CPU-GPU異構(gòu)系統(tǒng)的編程模型與性能優(yōu)化關(guān)鍵技術(shù)研究

發(fā)布時(shí)間:2019-05-19 14:27
【摘要】:科學(xué)計(jì)算永無(wú)止境的計(jì)算需求驅(qū)動(dòng)著高性能計(jì)算機(jī)系統(tǒng)進(jìn)入了千萬(wàn)億次時(shí)代,面向千萬(wàn)億次系統(tǒng)的各種關(guān)鍵技術(shù)將是未來(lái)構(gòu)建百萬(wàn)萬(wàn)億次系統(tǒng)的基石。受到CMOS工藝特征尺寸、功耗和散熱等技術(shù)的限制,完全依靠CPU提供計(jì)算能力的同構(gòu)計(jì)算機(jī)系統(tǒng)在到達(dá)千萬(wàn)億次系統(tǒng)規(guī)模后很難再進(jìn)行擴(kuò)充。而使用GPU作為加速器的異構(gòu)系統(tǒng)在性能功耗比方面比同構(gòu)系統(tǒng)更有優(yōu)勢(shì),也是構(gòu)建百萬(wàn)萬(wàn)億次系統(tǒng)最有前景的技術(shù)路線之一。2010年11月國(guó)防科大計(jì)算機(jī)學(xué)院為天津超算中心構(gòu)建的天河-1A使用了NVIDIA的Fermi GPU,以2.566 PFLOPS的可持續(xù)運(yùn)算速度排名世界第一。這種CPU-GPU異構(gòu)系統(tǒng)提供了強(qiáng)大的計(jì)算能力,但用戶編程和性能優(yōu)化都與傳統(tǒng)的同構(gòu)計(jì)算機(jī)不同,成為發(fā)揮整個(gè)系統(tǒng)性能的關(guān)鍵。針對(duì)目前大規(guī)模異構(gòu)系統(tǒng)上應(yīng)用程序編程難、優(yōu)化難的問(wèn)題,本文以千萬(wàn)億次CPU-GPU異構(gòu)系統(tǒng)為平臺(tái),研究了異構(gòu)系統(tǒng)的編程模型以及優(yōu)化方法。本文的主要?jiǎng)?chuàng)新點(diǎn)包括:1.首次在千萬(wàn)億次CPU-GPU異構(gòu)計(jì)算機(jī)系統(tǒng)上引入了MPI/Open MP/Streaming混合編程模型,并擴(kuò)展至全系統(tǒng)規(guī)模。針對(duì)混合編程模型中軟件任務(wù)到硬資源映射的問(wèn)題,提出了以結(jié)點(diǎn)為中心的任務(wù)映射、以CPU為中心的任務(wù)映射和以GPU為中心的任務(wù)映射。并針對(duì)大規(guī)模并行系統(tǒng)結(jié)點(diǎn)內(nèi)編程模型總結(jié)出7項(xiàng)需求:簡(jiǎn)單易用性、性能可擴(kuò)展性、存儲(chǔ)可擴(kuò)展性、模型層次性、調(diào)度靈活性、模型異構(gòu)性、定位準(zhǔn)確性,用于評(píng)估目前的編程模型。另外,提出了基于共享內(nèi)存的多進(jìn)程共享使用GPU的方法,并給出了高效編2.基于測(cè)量的自適應(yīng)任務(wù)劃分技術(shù)。我們將所有的任務(wù)放在一個(gè)任務(wù)隊(duì)列中,循環(huán)地從任務(wù)隊(duì)列中獲取任務(wù),每次取出的任務(wù)根據(jù)當(dāng)前的“任務(wù)劃分比率”劃分成CPU執(zhí)行和加速器執(zhí)行兩部分,初始的“任務(wù)劃分比率”由CPU和加速器的理論計(jì)算峰值得到。劃分完畢之后在異構(gòu)平臺(tái)上執(zhí)行,并在執(zhí)行完畢進(jìn)行實(shí)際性能測(cè)量,將統(tǒng)計(jì)得到的性能結(jié)果和本次劃分的任務(wù)負(fù)載相結(jié)合,更新“任務(wù)劃分比率”,作為下次任務(wù)劃分的依據(jù)。由于每次任務(wù)劃分并執(zhí)行完畢后,任務(wù)劃分比率都被自適應(yīng)地調(diào)整,使得主機(jī)和加速器之間的任務(wù)分配獲得了很好的負(fù)載平衡效果,大大提升了異構(gòu)系統(tǒng)的計(jì)算效率。3.基于有限狀態(tài)自動(dòng)機(jī)的嵌套雙緩沖軟件流水技術(shù)。GPU程序的執(zhí)行分為數(shù)據(jù)輸入、GPU計(jì)算、數(shù)據(jù)輸出三個(gè)部分。我們分析了異構(gòu)系統(tǒng)上軟件流水的執(zhí)行模型和代價(jià)模型,并設(shè)計(jì)了嵌套雙緩沖軟件流水機(jī)制。在實(shí)現(xiàn)過(guò)程中,我們使用了基于有限狀態(tài)自動(dòng)機(jī)的方法,用單個(gè)CPU線程控制了多任務(wù)的輸入、執(zhí)行和輸出,并將三者有序的重疊執(zhí)行。實(shí)驗(yàn)表明,這種方法極大緩解了主機(jī)和加速器間帶寬不足的問(wèn)題,能有效解決原有GPU庫(kù)性能波動(dòng)的問(wèn)題。針對(duì)BLAS3中DGEMM不同問(wèn)題規(guī)模的測(cè)試,平均性能提升達(dá)到7.61%。4.在千萬(wàn)億次CPU-GPU異構(gòu)系統(tǒng)上設(shè)計(jì)并實(shí)現(xiàn)了高效的LINPACK程序(Hybrid-LINPACK)。首先設(shè)計(jì)并實(shí)現(xiàn)了能夠同時(shí)使用CPU和GPU計(jì)算能力的異構(gòu)BLAS庫(kù),然后基于異構(gòu)BLAS庫(kù),使用了MPI/Open MP/Streaming混合編程模型,結(jié)合同構(gòu)系統(tǒng)上的高性能LINPACK實(shí)現(xiàn)(HPL 2.0),實(shí)現(xiàn)并優(yōu)化了Hybrid-LINPACK。優(yōu)化方法主要涉及CPU與GPU的任務(wù)劃分、CPU與GPU的通信優(yōu)化、SWAP算法并行化優(yōu)化、結(jié)點(diǎn)間數(shù)據(jù)傳輸優(yōu)化、以及HPL傳統(tǒng)的優(yōu)化方法和參數(shù)調(diào)優(yōu)等。Hybrid-LINPACK充分發(fā)揮了硬件和體系結(jié)構(gòu)設(shè)計(jì)提供的強(qiáng)大計(jì)算和通信能力,在天河-1單個(gè)計(jì)算單元上比AMD發(fā)布的LINPACK實(shí)現(xiàn)取得了3.3倍的加速比,獲得70.1%的計(jì)算效率。最終全系統(tǒng)LINPACK測(cè)試在天河-1和天河-1A上分別取得了0.563PFLOPS和2.566 PFLOPS的實(shí)測(cè)性能。使得天河-1在2009年11月排名TOP500第五,天河-1A在2010年11月排名第一,都創(chuàng)下我國(guó)超級(jí)計(jì)算機(jī)TOP500排名歷史上的最好成績(jī)。
[Abstract]:The scientific calculation of the ever-ending computing demand drives a high-performance computer system into the billions of times, and the key technologies for the millions of systems will be the cornerstone of the future of a million-billion-dollar system. Due to the limitations of the CMOS process feature size, power consumption and heat dissipation, the isomorphic computer system, which is fully dependent on the CPU to provide the computing power, is difficult to expand after reaching the system size of millions of times. The use of the GPU as an accelerator is one of the most promising technology routes in performance-power-ratio, and is one of the most promising technical routes to build a million-trillion-dollar system. The Tianhe-1A, built by the University of Great Computer in Tianjin in November 2010, uses the NVIDIA's Fermi GPU, The world's first is ranked at 2.566 PFLOPS. The CPU-GPU heterogeneous system provides powerful computing power, but the programming and performance optimization of the user are different from the traditional homogeneous computer, and become the key to the performance of the whole system. In order to solve the problem that the application program is difficult and difficult to be optimized on the large-scale heterogeneous system, the programming model and the optimization method of the heterogeneous system are studied in this paper. The main innovation points of this paper include:1. The MPI/ Open MP/ Streaming mixed programming model was introduced for the first time on a million-million CPU-GPU heterogeneous computer system and extended to the system-wide scale. In order to solve the problem of software task-to-hard resource mapping in hybrid programming model, a node-centric task map, a CPU-centric task map and a GPU-centric task map are proposed. And the seven requirements are summarized for a large-scale parallel system node internal programming model: the simple usability, the performance expandability, the storage expandability, the model hierarchy, the scheduling flexibility, the model heterogeneity and the positioning accuracy, and is used for evaluating the current programming model. In addition, the method of multi-process sharing using the GPU based on shared memory is put forward, and the high-efficiency part 2 is given. Self-adaptive task partitioning technology based on measurement. We put all the tasks in a task queue, and the task is cyclically taken from the task queue. Each time the task is taken out is divided into two parts of the CPU execution and the accelerator according to the current "task division ratio". The initial "task division ratio" is obtained by the theoretical calculation peak of the CPU and the accelerator. After the division is finished, executing on a heterogeneous platform, performing actual performance measurement after the division is completed, combining the obtained performance results and the divided task load, and updating the "task division ratio" as the basis of the next task division. The task partition ratio is adaptively adjusted after each task is divided and executed, so that the task allocation between the host and the accelerator is well balanced, and the computing efficiency of the heterogeneous system is greatly improved. The invention relates to a nested double-buffering software running water technology based on a finite state automaton. The execution of the GPU program is divided into three parts: data input, GPU calculation, and data output. We analyzed the execution model and cost model of the software running water on the heterogeneous system, and designed the nested dual-buffer software running-water mechanism. In the course of implementation, we use a finite state automaton to control the input, execution, and output of multitask with a single CPU thread, and perform the orderly overlapping of the three. The experiment shows that this method greatly reduces the problem of insufficient bandwidth between the host and the accelerator, and can effectively solve the problem of the performance fluctuation of the original GPU library. The average performance of DGEMM in BLS3 was 7.61%. A high-efficiency LINPACK (Hybrid-LINPACK) program is designed and implemented on a 10 million CPU-GPU heterogeneous system. First of all, a heterogeneous BLAS library capable of simultaneously using CPU and GPU computing power is designed and implemented, and then the MPI/ Open MP/ Streaming mixed programming model is used based on the heterogeneous BLAS library, and the hybrid-LINPACK is realized and optimized in combination with the high-performance LINPACK implementation (HPL 2.0) on the homogeneous system. The optimization method is mainly concerned with the task division of the CPU and the GPU, the communication optimization of the CPU and the GPU, the parallel optimization of the SWAP algorithm, the optimization of data transmission among the nodes, and the optimization method and the parameter adjustment of the HPL tradition. Hybrid-LINPACK gives full play to the powerful computing and communication capabilities provided by the hardware and architecture design, and a 3.3-fold acceleration ratio is achieved on the Tianhe-1 single computing unit than the LINPACK issued by AMD, yielding 70.1% of the computational efficiency. The final system LINPACK test has obtained the measured performance of 0.563 PFLOPS and 2.566 PFLOPS on the Tianhe-1 and Tianhe-1A, respectively. The Tianhe-1 ranked the fifth of the TOP500 in November 2009, and the Tianhe-1A ranked the first in November 2010, all the best in the history of the TOP 500 of China's supercomputer.
【學(xué)位授予單位】:國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP338

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 歐陽(yáng)t,

本文編號(hào):2480800


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2480800.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶df8e2***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
亚洲欧洲成人精品香蕉网| 久久黄片免费播放大全 | 日韩欧美中文字幕av| 最近的中文字幕一区二区| 亚洲欧美日韩国产自拍| 精品国产91亚洲一区二区三区| 四季精品人妻av一区二区三区| 国产精品99一区二区三区| 亚洲性生活一区二区三区| 亚洲中文字幕在线乱码av| 日韩欧美中文字幕av| 五月激情婷婷丁香六月网| 美国女大兵激情豪放视频播放| 欧美日韩国产欧美日韩| 日韩在线视频精品视频| 国产精品伦一区二区三区四季| 成人免费观看视频免费| 十八禁日本一区二区三区| 老司机精品视频免费入口 | 中文字幕亚洲精品在线播放| 亚洲熟女一区二区三四区| 欧美亚洲三级视频在线观看| 欧美日韩亚洲巨色人妻| 大伊香蕉一区二区三区| 激情中文字幕在线观看| 欧美日本道一区二区三区| 国产免费无遮挡精品视频| 护士又紧又深又湿又爽的视频| 日本午夜免费观看视频| 亚洲国产一区精品一区二区三区色| 中国一区二区三区不卡| 午夜视频在线观看日韩| 日韩在线欧美一区二区| 黄片在线免费看日韩欧美| 男女午夜福利院在线观看| 黄色国产自拍在线观看| 国产日韩欧美综合视频| 人妻内射在线二区一区| 国产传媒精品视频一区| 国产一二三区不卡视频| 好吊日视频这里都是精品|