眾核片上私有型末級(jí)Cache共享化架構(gòu)設(shè)計(jì)與實(shí)現(xiàn)
本文選題:眾核處理系統(tǒng) 切入點(diǎn):片上存儲(chǔ)架構(gòu) 出處:《上海交通大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:應(yīng)用復(fù)雜度的不斷上升以及芯片功耗的制約,使得單核、多核處理系統(tǒng)逐漸難以滿足需要;處理數(shù)量更多的眾核處理系統(tǒng)受到了越來(lái)越多的關(guān)注。然而,更多的處理核數(shù)量,卻對(duì)作為處理系統(tǒng)性能關(guān)鍵因素的片上存儲(chǔ)架構(gòu)的設(shè)計(jì)帶來(lái)了新的挑戰(zhàn):其一,處理核數(shù)量增多導(dǎo)致芯片規(guī)模上升、片上訪存延時(shí)上升;其二,眾核處理系統(tǒng)中對(duì)應(yīng)用細(xì)粒度并行化而衍生的共享化數(shù)據(jù)模型,導(dǎo)致單個(gè)處理核存儲(chǔ)空間需求上升。面對(duì)新挑戰(zhàn),傳統(tǒng)多核片上存儲(chǔ)架構(gòu)均存在缺陷:共享型末級(jí)Cache架構(gòu)將引起大量片上網(wǎng)絡(luò)通信,且其相對(duì)單個(gè)處理核的模塊化與可擴(kuò)展性較差;私有型末級(jí)Cache架構(gòu),其單個(gè)處理核的等效存儲(chǔ)空間較小,導(dǎo)致對(duì)片外存儲(chǔ)單元的訪存過(guò)多;Cooperative Caching架構(gòu),為請(qǐng)求數(shù)據(jù)塊的處理核所提供的選擇較少,易導(dǎo)致長(zhǎng)距離、橫跨芯片的數(shù)據(jù)塊訪存。 針對(duì)眾核處理系統(tǒng)所帶來(lái)的新挑戰(zhàn)及傳統(tǒng)架構(gòu)應(yīng)對(duì)時(shí)的不足,本文提出了眾核片上私有型末級(jí)Cache共享化架構(gòu)。以未來(lái)眾核系統(tǒng)中更具潛力的私有型末級(jí)Cache架構(gòu)為基礎(chǔ),通過(guò)將單個(gè)處理核的被替換數(shù)據(jù)塊保留于片上其他處理核中,并允許片上各處理核間的數(shù)據(jù)塊互相訪存,實(shí)現(xiàn)私有型末級(jí)Cache架構(gòu)的共享化,提高單個(gè)處理核存儲(chǔ)空間的等效容量。通過(guò)對(duì)被替換數(shù)據(jù)塊在片上保留多個(gè)副本,為該數(shù)據(jù)塊的請(qǐng)求處理核提供更多選擇,得以從更合適的地方獲得數(shù)據(jù)塊。同時(shí),通過(guò)基于閾值在線動(dòng)態(tài)調(diào)整的被替換數(shù)據(jù)塊保留數(shù)量判決算法與基于存儲(chǔ)資源利用率在線監(jiān)測(cè)的被替換數(shù)據(jù)塊保留位置選擇算法,分別從保留數(shù)量與保留位置兩個(gè)維度,細(xì)粒度地控制被替換數(shù)據(jù)塊的多副本保留,,減少保留行為對(duì)其他處理核存儲(chǔ)空間的影響。 本文在描述了所提出架構(gòu)具體實(shí)現(xiàn)方案的基礎(chǔ)上,對(duì)其硬件實(shí)現(xiàn)代價(jià)作了分析:本架構(gòu)硬件額外開(kāi)銷約為4.35%~8.20%。同時(shí),本文利用GEM5全系統(tǒng)仿真平臺(tái),以64核眾核處理系統(tǒng)為例,將本文所提出架構(gòu),與傳統(tǒng)架構(gòu)進(jìn)行對(duì)比。性能分析結(jié)果顯示:本架構(gòu)在片上網(wǎng)絡(luò)通信負(fù)荷上,相比共享型末級(jí)Cache架構(gòu)減少78.6%,相比私有型末級(jí)Cache架構(gòu)略有增加,相比Cooperative Caching架構(gòu)減少11.9%;在片外存儲(chǔ)單元訪存負(fù)荷上,相比私有型末級(jí)Cache架構(gòu)下降25.6%,相比Cooperative Caching架構(gòu)下降6.5%;在眾核處理系統(tǒng)整體處理性能上,相比共享型末級(jí)Cache架構(gòu)均提升59.5%,私有型末級(jí)Cache架構(gòu)最好情況提升11.9%、平均提升6.2%;Cooperative Caching架構(gòu)最好提升11.2%、平均提升5.3%。綜合上述硬件實(shí)現(xiàn)代價(jià)及性能分析結(jié)果,證明了本架構(gòu)能有效提升片上存儲(chǔ)架構(gòu)及整個(gè)眾核處理系統(tǒng)性能;同時(shí),證明了本文提出的從保留數(shù)量與保留位置對(duì)被替換數(shù)據(jù)塊保留行為進(jìn)行控制的算法的有效性。
[Abstract]:With the increasing complexity of applications and the restriction of chip power consumption, it is difficult for single-core and multi-core processing systems to meet the demand, and more and more multi-core processing systems have attracted more and more attention. However, it brings new challenges to the design of on-chip memory architecture, which is a key factor in processing system performance. First, the increase in the number of processing cores leads to an increase in chip size, and the increase in chip petition latency. The shared data model derived from fine-grained parallelization of applications in multi-kernel processing systems leads to an increase in storage space requirements for single processing cores. The traditional multi-core on-chip storage architecture has some defects: the shared-end Cache architecture will cause a large amount of on-chip network communication, and its modularization and scalability are poor compared with a single processor core, and the private Cache architecture of the last stage. The equivalent storage space of the single processing core is small, which leads to excessive memory access to the off-chip memory unit and cooperative Caching architecture, which provides less choice for the processing core of the request data block. It is easy to lead to long distance access to the data block across the chip. In view of the new challenges brought by the multikernel processing system and the shortcomings of the traditional architecture, this paper proposes a private Cache sharing architecture on the multikernel chip, which is based on the more potential private Cache architecture in the future multikernel system. By retaining the replaced data blocks of a single processing core in other processing cores on a chip, and allowing the data blocks between the processing cores on a chip to visit each other, the sharing of the private Cache architecture is realized. Increases the equivalent capacity of the storage space of a single processing core. By retaining multiple copies of the replaced data block on the chip, it provides more options for the request processing core of the data block, allowing it to obtain the data block from a more appropriate place. At the same time, Based on the online dynamic adjustment of threshold, the decision algorithm of the reserved number of the replaced data block and the algorithm of selecting the reserved position of the replaced data block based on the online monitoring of the utilization of the storage resources are used to select the reserved position of the replaced data block, respectively from two dimensions: the reserved number and the reserved position. The multi-replica reservation of the replaced data block is controlled fine-grained to reduce the effect of the reservation behavior on the storage space of other processing cores. On the basis of describing the concrete implementation scheme of the architecture, this paper analyzes the cost of the hardware implementation: the extra cost of the architecture hardware is about 4.35 and 8.20.At the same time, this paper makes use of the GEM5 full-system simulation platform and takes 64 core multi-core processing system as an example. Compared with the traditional architecture, the performance analysis shows that the architecture is 78.6 less than the shared Cache architecture and slightly more private than the private Cache architecture. Compared with the Cooperative Caching architecture, there is a decrease of 11.9% in the out-of-chip memory access load, 25.6% lower than the private Cache architecture, 6.5% lower than the Cooperative Caching architecture, and 6.5% lower than the Cooperative Caching architecture, and the overall processing performance of the multi-core processing system, Compared with the shared Cache architecture, the last level of Cache architecture is 59.5% higher, the private Cache architecture of the last level is 11.9% higher, the average 6.2U Caching architecture is improved 11.2cm, and the average increase is 5.3.The results of the above hardware implementation cost and performance analysis are summarized. It is proved that the proposed architecture can effectively improve the performance of the on-chip storage architecture and the entire multi-kernel processing system. At the same time, the effectiveness of the algorithm proposed in this paper to control the reserved behavior of the replaced data blocks from the retention number and the reserved position is proved.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TN47;TP332
【共引文獻(xiàn)】
相關(guān)期刊論文 前10條
1 徐力;史少波;王沁;;面向SDR應(yīng)用的多核DSP低功耗設(shè)計(jì)[J];電子科技大學(xué)學(xué)報(bào);2012年01期
2 張戈;張量;楊榮秋;;納米級(jí)工藝下多處理器功耗評(píng)估與優(yōu)化技術(shù)[J];中國(guó)集成電路;2008年07期
3 隋秀峰;吳俊敏;陳國(guó)良;;ARP:同時(shí)多線程處理器中共享Cache自適應(yīng)運(yùn)行時(shí)劃分機(jī)制[J];計(jì)算機(jī)研究與發(fā)展;2008年07期
4 賈耀倉(cāng);武成崗;張兆慶;;指導(dǎo)cache靜態(tài)劃分的程序性能profiling優(yōu)化技術(shù)[J];計(jì)算機(jī)研究與發(fā)展;2012年01期
5 賈小敏;張民選;齊樹(shù)波;趙天磊;;片上多核Cache資源管理機(jī)制研究[J];計(jì)算機(jī)科學(xué);2011年01期
6 所光;;一種面向多核處理器粗粒度的應(yīng)用級(jí)Cache劃分方法[J];計(jì)算機(jī)工程與科學(xué);2009年S1期
7 所光;楊學(xué)軍;;多核處理機(jī)系統(tǒng)Cache管理技術(shù)研究現(xiàn)狀[J];計(jì)算機(jī)工程與科學(xué);2010年07期
8 熊偉;殷建平;所光;趙志恒;;多核處理器面向低功耗的共享Cache劃分方案[J];計(jì)算機(jī)工程與科學(xué);2010年10期
9 所光;楊學(xué)軍;;面向多線程多道程序的加權(quán)共享Cache劃分[J];計(jì)算機(jī)學(xué)報(bào);2008年11期
10 宋風(fēng)龍;劉志勇;范東睿;張軍超;余磊;;一種片上眾核結(jié)構(gòu)共享Cache動(dòng)態(tài)隱式隔離機(jī)制研究[J];計(jì)算機(jī)學(xué)報(bào);2009年10期
相關(guān)博士學(xué)位論文 前10條
1 隋秀峰;高性能微處理器中自適應(yīng)高速緩存管理策略研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2010年
2 張琦;多核系統(tǒng)中的程序性能優(yōu)化研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2010年
3 王玉林;多節(jié)點(diǎn)容錯(cuò)存儲(chǔ)系統(tǒng)的數(shù)據(jù)與緩存組織研究[D];電子科技大學(xué);2010年
4 林雋民;基于重用距離預(yù)測(cè)與流檢測(cè)的高速緩存替換算法研究[D];清華大學(xué);2010年
5 杜建軍;共享高速緩存多核處理器的關(guān)鍵技術(shù)研究[D];重慶大學(xué);2011年
6 劉德峰;面向存儲(chǔ)級(jí)并行的多核處理器關(guān)鍵技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2011年
7 賈小敏;多核處理器片上Cache訪問(wèn)行為分析與優(yōu)化機(jī)制研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2011年
8 鄧林;單芯片多核處理器存儲(chǔ)優(yōu)化技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2011年
9 陳小文;同構(gòu)眾核處理器的片上存儲(chǔ)管理與同步機(jī)制研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2011年
10 趙天磊;微處理器Cache訪問(wèn)行為分析技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2011年
相關(guān)碩士學(xué)位論文 前10條
1 王震;CMP架構(gòu)下的共享Cache動(dòng)態(tài)劃分[D];吉林大學(xué);2011年
2 尹巍;多核處理器中最后一級(jí)共享高速緩存管理策略研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2011年
3 李家文;虛擬機(jī)環(huán)境下動(dòng)態(tài)Cache劃分技術(shù)研究與實(shí)現(xiàn)[D];國(guó)防科學(xué)技術(shù)大學(xué);2011年
4 馬萌;面向程序訪存特征的存儲(chǔ)優(yōu)化技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2011年
5 蔣寧;嵌入式數(shù)據(jù)庫(kù)的緩存優(yōu)化與實(shí)時(shí)事務(wù)調(diào)度[D];浙江大學(xué);2006年
6 陳偉;基于ARM的輕量級(jí)TCP/IP協(xié)議棧的移植及應(yīng)用[D];山東輕工業(yè)學(xué)院;2009年
7 程為;高速鐵路異物侵限監(jiān)控系統(tǒng)設(shè)計(jì)[D];武漢理工大學(xué);2010年
8 唐夷簡(jiǎn);芯片多線程處理器線程調(diào)度的性能測(cè)試與優(yōu)化研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2009年
9 黃健;基于多核的多虛擬機(jī)計(jì)算資源調(diào)度系統(tǒng)[D];華中科技大學(xué);2009年
10 張杰;基于CMP的共享L2Cache管理策略研究[D];哈爾濱工程大學(xué);2013年
本文編號(hào):1580797
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1580797.html