當(dāng)前位置：主頁(yè) > 科技論文 > 計(jì)算機(jī)論文 >

EDGE體系結(jié)構(gòu)指令動(dòng)態(tài)映射算法研究

發(fā)布時(shí)間：2018-07-24 13:38

【摘要】：亂序超標(biāo)量處理器中廣泛存在的集總式結(jié)構(gòu)已嚴(yán)重限制微處理器性能的提升。EDGE(Explicit Data Graph Execution)作為應(yīng)對(duì)微處理器性能提升瓶頸的模型之一，從結(jié)構(gòu)模型中摒棄了超標(biāo)量中能耗大不易擴(kuò)展的集總式結(jié)構(gòu)。在分布式EDGE結(jié)構(gòu)中，指令映射到多個(gè)分片上同時(shí)執(zhí)行。分片之間操作數(shù)傳遞需要延時(shí)從而導(dǎo)致性能下降。指令映射算法通過(guò)仔細(xì)權(quán)衡程序的并行度和分片間通信延時(shí)來(lái)試圖消除分片后帶來(lái)的性能損失。 TRIPS微處理器采用關(guān)鍵資源拓?fù)浣Y(jié)構(gòu)不對(duì)稱分布和靜態(tài)指令映射算法(SPDI, Static Placement Dynamic Issue)。這會(huì)導(dǎo)致ET(Execute Tile)上較大的負(fù)載不均衡和操作數(shù)網(wǎng)絡(luò)通信熱點(diǎn)，從而引起IPC下降。本文在M5-EDGE模擬器中實(shí)現(xiàn)與TRIPS類似的EDGE結(jié)構(gòu)，以此來(lái)研究指令動(dòng)態(tài)Deep映射算法。在缺乏編譯器調(diào)度下，采用循環(huán)映射方式的Deep算法在發(fā)射寬度為1和2時(shí)IPC分別為SPDI的85%和98.3%。針對(duì)RT(Register Tile)和DT(Data-cache Tile)的拓?fù)湮恢茫瑢?duì)Deep映射進(jìn)行三種優(yōu)化：依照ET編號(hào)順序、“之”字形順序和計(jì)算甚塊全局通信跳步數(shù)之和來(lái)優(yōu)先選擇ET。在發(fā)射寬度為1時(shí)三種優(yōu)化與基本的Deep算法相比平均跳步分別減少2.63%、2.18%和4.70%，而IPC分別提升1.07%、1.21%和2.11%。這說(shuō)明在Deep映射下優(yōu)化指令間通信跳步數(shù)能顯著提高IPC。在Deep映射算法中，90%以上的操作數(shù)通過(guò)操作數(shù)旁路來(lái)傳遞，大大減少操作數(shù)網(wǎng)絡(luò)的負(fù)載。在bypass寬度為2倍發(fā)射寬度時(shí)，，本地的操作數(shù)傳遞延時(shí)幾乎下降為0。增加本地bypass寬度，能有效的減少操作數(shù)傳遞的延時(shí)。將RT按編號(hào)分配到ET上，基本Deep映射算法的IPC提升1.77%。針對(duì)DT位置進(jìn)行優(yōu)化，優(yōu)先選擇靠近DT的ET和計(jì)算甚塊通信跳數(shù)之和選擇ET。這兩種優(yōu)化比基本Deep映射IPC分別提升1.17%和1.89%。將RT和DT平鋪到ET中形成4x4的拓?fù)浣Y(jié)構(gòu)。在發(fā)射寬度為1和2時(shí)該結(jié)構(gòu)中Deep映射的IPC分別為SPDI的97.18%和113.42%。計(jì)算跳步數(shù)選擇ET，這一比值為97.32%和114.06%。微結(jié)構(gòu)變化導(dǎo)致拓?fù)渚嚯x變小或者Deep映射算法優(yōu)化通信跳步數(shù)時(shí)，能顯著提高系統(tǒng)IPC。
[Abstract]:The lumped structure widely existing in scrambled superscalar processors has seriously restricted the performance improvement of microprocessors. Edge (Explicit Data Graph Execution) is one of the models to deal with the bottleneck of microprocessor performance enhancement. The lumped structure with large energy consumption in superscalar is abandoned from the structural model. In a distributed EDGE architecture, instructions are mapped to multiple slices to execute simultaneously. The transmission of operands between slices requires delay, which results in performance degradation. The instruction mapping algorithm tries to eliminate the performance loss caused by fragmentation by carefully weighing the program parallelism and inter-slice communication delay. The TRIPS microprocessor adopts asymmetric distribution of critical resource topology and static reference. Mapping algorithm (SPDI, Static Placement Dynamic Issue). This will lead to a large load imbalance and Operand network communication hot spots on the ET (Execute Tile), thus causing a decrease in IPC. In this paper, a EDGE structure similar to TRIPS is implemented in the M5-EDGE simulator to study the instruction dynamic Deep mapping algorithm. In the absence of compiler scheduling, the Deep algorithm using cyclic mapping is 85% of SPDI and 98.3% of SPDI when the transmission width is 1 and 2, respectively. According to the topological position of RT (Register Tile) and DT (Data-cache Tile), three kinds of optimization of Deep mapping are carried out: according to the order of et numbering, the glyph order of "its" and the sum of calculating the number of leapfrogging steps in the global communication of very block to select ETs first. When the launch width is 1, the average jump steps of the three optimizations are 2.63% and 4.70% less than those of the basic Deep algorithm, respectively, while the IPC increases by 1.07% and 2.11%, respectively. This shows that optimizing the jump number of inter-instruction communication under Deep mapping can significantly increase the number of jump steps. In the Deep mapping algorithm, more than 90% of the operands are transferred by the optograph bypass, which greatly reduces the load of the operands network. When the bypass width is 2 times the transmit width, the local Operand transfer delay is almost reduced to 0. 0. Increasing the local bypass width can effectively reduce the delay of Operand transfer. RT is assigned to et by number, and the IPC of basic Deep mapping algorithm increases by 1.77. For the DT position optimization, the et near DT and the sum of calculated VBS hops are selected first. These two optimizations are 1.17% and 1.89% higher than the basic Deep mapping IPC, respectively. The RT and DT are tiled into the et to form the topological structure of 4x4. When the emission width is 1 and 2, the IPC of Deep map is 97.18% of SPDI and 113.42% of SPDI, respectively. The ratio of ETs was 97.32% and 114.06% respectively. When the topology distance becomes smaller or the Deep mapping algorithm optimizes the number of communication hops, the system IPCs can be improved significantly.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP332;TP301.6

【共引文獻(xiàn)】

相關(guān)期刊論文前10條

1 裴頌文;吳小東;唐作其;熊乃學(xué);;異構(gòu)千核處理器系統(tǒng)的統(tǒng)一內(nèi)存地址空間訪問(wèn)方法[J];國(guó)防科技大學(xué)學(xué)報(bào);2015年01期

2 楊文頂;覃志東;;基于NoC的眾核處理器可靠性仿真分析研究[J];智能計(jì)算機(jī)與應(yīng)用;2015年02期

3 劉東;張進(jìn)寶;廖小飛;金海;;面向混合內(nèi)存體系結(jié)構(gòu)的模擬器[J];華東師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年05期

4 謝子超;佟冬;黃明凱;;A General Low-Cost Indirect Branch Prediction Using Target Address Pointers[J];Journal of Computer Science and Technology;2014年06期

5 李凌達(dá);陸俊林;程旭;;Retention Benefit Based Intelligent Cache Replacement[J];Journal of Computer Science and Technology;2014年06期

6 李笑天;殷淑娟;何虎;;一種DSP周期精度高效建模方法[J];計(jì)算機(jī)應(yīng)用研究;2015年01期

7 劉雨辰;王佳;陳云霽;焦帥;;計(jì)算機(jī)系統(tǒng)模擬器研究綜述[J];計(jì)算機(jī)研究與發(fā)展;2015年01期

8 黃明凱;劉先華;譚明星;謝子超;程旭;;一種面向解釋器的間接轉(zhuǎn)移預(yù)測(cè)技術(shù)[J];計(jì)算機(jī)研究與發(fā)展;2015年01期

9 黃永兵;陳明宇;;移動(dòng)設(shè)備應(yīng)用程序的體系結(jié)構(gòu)特征分析[J];計(jì)算機(jī)學(xué)報(bào);2015年02期

10 楊群;李笑天;何虎;;面向Superscalar與VLIW混合架構(gòu)處理器的調(diào)試器設(shè)計(jì)[J];計(jì)算機(jī)應(yīng)用與軟件;2015年05期

相關(guān)博士學(xué)位論文前2條

1 章鐵飛;基于程序訪存模式的存儲(chǔ)系統(tǒng)節(jié)能技術(shù)研究[D];浙江大學(xué);2013年

2 修思文;MPSoC性能估計(jì)技術(shù)研究[D];浙江大學(xué);2015年

相關(guān)碩士學(xué)位論文前10條

1 王勛;面向非易失存儲(chǔ)器PCM的節(jié)能技術(shù)研究[D];浙江工業(yè)大學(xué);2013年

2 辛愿;面向嵌入式系統(tǒng)的自調(diào)數(shù)據(jù)預(yù)取[D];浙江大學(xué);2013年

3 胡妍;結(jié)合結(jié)構(gòu)級(jí)和門(mén)級(jí)的多核處理器功耗評(píng)估方法[D];湖南大學(xué);2013年

4 劉雨辰;基于多維數(shù)組的高速片上網(wǎng)絡(luò)模擬器的設(shè)計(jì)與實(shí)現(xiàn)[D];內(nèi)蒙古大學(xué);2014年

5 單磊;大規(guī)模并行片上系統(tǒng)的分布式并行模擬關(guān)鍵技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2012年

6 佘超杰;基于多核的片上網(wǎng)絡(luò)低延遲與低功耗的研究[D];北京工業(yè)大學(xué);2014年

7 艾天鵬;基于通訊感知的片上網(wǎng)絡(luò)加速機(jī)制研究[D];浙江工業(yè)大學(xué);2014年

8 陸yN;基于計(jì)算模型的體系結(jié)構(gòu)模擬器研究[D];復(fù)旦大學(xué);2013年

9 張浪;面向異構(gòu)集成的NoC路由算法研究[D];武漢理工大學(xué);2014年

10 繆旭陽(yáng);復(fù)雜體系結(jié)構(gòu)的計(jì)算特征分類研究[D];武漢理工大學(xué);2014年

本文編號(hào)：2141553

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2141553.html

上一篇：基于單片機(jī)的醫(yī)用點(diǎn)滴液速度監(jiān)控系統(tǒng)設(shè)計(jì)
下一篇：東莞證券數(shù)據(jù)中心的分析與設(shè)計(jì)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

EDGE體系結(jié)構(gòu)指令動(dòng)態(tài)映射算法研究