面向延遲優(yōu)化的多核處理器Cache數(shù)據(jù)管理機制研究

發(fā)布時間：2018-06-05 21:34

本文選題：多核處理器 + 大容量Cache��；參考：《國防科學(xué)技術(shù)大學(xué)》2013年博士論文

【摘要】：半導(dǎo)體工藝水平的不斷提高和集成電路設(shè)計能力的快速發(fā)展,為多核處理器的誕生提供了必要的孵化環(huán)境并持續(xù)推動其設(shè)計技術(shù)走向成熟。目前,多核處理器憑借其計算能力較強、設(shè)計復(fù)雜度較低、可擴展性較好等優(yōu)勢,已經(jīng)廣泛應(yīng)用于商業(yè)服務(wù)器、高性能計算、個人電腦、嵌入式系統(tǒng)等領(lǐng)域并且表現(xiàn)出有力的競爭優(yōu)勢。然而,隨著多核計算能力與片外訪存速度之間差異的不斷增大,“存儲墻”問題已經(jīng)成為嚴(yán)重束縛多核處理器性能提升的關(guān)鍵瓶頸。片上Cache作為彌補處理器和內(nèi)存之間速度差異的中間橋梁組件,是緩解“存儲墻”問題的最佳著眼點和關(guān)鍵突破口。合理組織并充分利用片上Cache資源,設(shè)計高效的Cache數(shù)據(jù)管理機制,對于改善微處理器整體性能至關(guān)重要。隨著片上Cache容量的不斷增加和復(fù)雜片上互連結(jié)構(gòu)的采用,加之應(yīng)用程序訪存特性多樣化的影響,致使多核環(huán)境下大容量Cache設(shè)計面臨許多新的嚴(yán)峻挑戰(zhàn),傳統(tǒng)的私有或共享Cache結(jié)構(gòu)無法在低失效率和低命中延遲之間進行有效權(quán)衡,嚴(yán)重制約訪存系統(tǒng)性能提升。本文針對微處理器設(shè)計中的“存儲墻”問題展開研究,在分析私有、共享以及混合Cache結(jié)構(gòu)面臨的挑戰(zhàn)性問題和潛在優(yōu)化空間的基礎(chǔ)上,探索面向延遲優(yōu)化的多核處理器Cache數(shù)據(jù)管理機制。本文取得的主要研究成果如下:第一,針對多核私有Cache結(jié)構(gòu)面臨的容量失效問題,本文提出一種基于細(xì)粒度偽劃分的核間容量共享機制CSFP,通過在細(xì)粒度層次為每個Cache Bank設(shè)置加權(quán)飽和計數(shù)器陣列來統(tǒng)計和預(yù)測各線程的訪存需求差異情況,控制各個處理器核在每個Cache Set上的私有域與共享域劃分比例,并以此指導(dǎo)各處理器核上的犧牲塊替換、溢出與接收決策,利用智能的核間容量借用機制來均衡處理器間訪存需求差異,從而緩解多核私有Cache結(jié)構(gòu)面臨的容量失效問題。在周期精確的體系結(jié)構(gòu)級全系統(tǒng)模擬器Simics平臺上,本文對16核瓦片式結(jié)構(gòu)下的CSFP機制進行了性能評估與分析,實驗表明,CSFP機制能夠有效改善多核私有Cache結(jié)構(gòu)的容量失效問題,多線程測試程序的運行時間平均可以得到大約8.57%的壓縮。第二,針對多核共享Cache環(huán)境下多線程競爭訪問Cache資源導(dǎo)致的沖突失效問題,本文提出一種基于偏轉(zhuǎn)映射的沖突失效隔離機制IMI-SM,當(dāng)多核片上末級共享Cache發(fā)生失效需要從片外存儲器取數(shù)據(jù)時,如果靜態(tài)目標(biāo)Cache Set中的LRU候選犧牲塊被逐出時可能導(dǎo)致線程間或線程內(nèi)沖突失效,則啟動偏轉(zhuǎn)映射機制。通過引入專用的沖突隔離緩存區(qū),或者采用Bank內(nèi)縱向壓力均衡策略來擴展數(shù)據(jù)映射時的候選目標(biāo)Set選擇范圍,IMI-SW允許將從內(nèi)存取來的新數(shù)據(jù)塊保存在片上沖突隔離緩存區(qū)或者存儲壓力相對較小的其它靜態(tài)耦合Cache Set中,以此緩解沖突失效對共享Cache片上整體命中率造成的負(fù)面影響。實驗結(jié)果表明,IMI-SM可以顯著減少多核處理器在共享Cache資源時面臨的沖突失效現(xiàn)象,程序運行時間平均可以降低7.35%左右,因此能夠以較小的硬件代價獲得較高的訪存性能提升。第三,針對瓦片式多核處理器分布式共享Cache結(jié)構(gòu)面臨的長延遲命中問題,本文提出一種增強型選擇性犧牲塊復(fù)制機制E-VR,在原始犧牲塊復(fù)制操作的基礎(chǔ)上引入候選犧牲塊過濾和目標(biāo)組檢測機制,在進行犧牲塊復(fù)制操作時不但考慮其共享模式和讀寫特性,而且從細(xì)粒度層次考慮本地Cache Bank內(nèi)訪存壓力縱向非均衡分布特性,通過減少高代價復(fù)制操作的發(fā)生概率和擴展?fàn)奚鼔K候選存放目標(biāo)Set的選擇范圍,提高復(fù)制操作的性能獲益。實驗結(jié)果表明,E-VR可以將各應(yīng)用程序的運行時間平均降低6.97%左右。E-VR在降低片上命中訪問延遲的同時,避免對共享Cache的全局命中率造成過大負(fù)面影響,能夠在低命中延遲和低失效率之間進行動態(tài)權(quán)衡,訪存系統(tǒng)性能得到進一步改善。第四,面向瓦片式多核分布式Cache的虛擬共享域劃分結(jié)構(gòu),本文提出將數(shù)據(jù)自適應(yīng)替換、遷移與復(fù)制機制集成為統(tǒng)一的數(shù)據(jù)管理框架F-RMR。F-RMR不但在數(shù)據(jù)替換時能夠感知本地目標(biāo)Cache Set中候選犧牲塊的活躍狀態(tài)和片上唯一性,而且在多個虛擬共享域間進行數(shù)據(jù)遷移和復(fù)制決策時能夠協(xié)同感知命中數(shù)據(jù)的活躍程度與目標(biāo)Cache Set的空閑狀態(tài)。通過替換、遷移與復(fù)制三者之間的協(xié)作,片上Cache長延遲命中和容量有效利用率之間的矛盾權(quán)衡問題得到妥善處理。實驗結(jié)果表明,當(dāng)共享域劃分粒度為4時,多線程測試程序在F-RMR下的平均存儲訪問延遲平均可以降低7.59%左右。與原始虛擬共享域劃分機制相比,F-RMR在不同共享域劃分粒度情況下均可獲得相應(yīng)的性能提升,面積開銷可以忽略不計。
[Abstract]:The continuous improvement of semiconductor technology and the rapid development of integrated circuit design capabilities provide the necessary incubator environment for the birth of multi-core processors and continue to promote its design technology. At present, multi-core processors have been widely used by their advantages, such as strong computing power, low design complexity, good scalability and so on. In the fields of commercial servers, high-performance computing, personal computers, embedded systems and other fields, there is a strong competitive advantage. However, with the increasing difference between the multi-core computing power and the rate of out of chip memory, the "storage wall" problem has become a critical bottleneck for the performance promotion of multi-core devices. The Cache on the sheet is used as a mass. The intermediate bridge component of the speed difference between the processor and the memory is the best point of view and key breakthrough in alleviating the "storage wall" problem. It is essential to organize and make full use of the Cache resources on the chip and to design the efficient Cache data management mechanism, which is very important to improve the overall performance of the microprocessor. With the continuous increase of the Cache capacity on the chip. With the adoption of the interconnection structure on the complex chip and the influence of the diversity of application memory characteristics, the design of large capacity Cache in multi-core environment faces many new challenges. The traditional private or shared Cache structure can not make a trade-off between low loss efficiency and low hit delay, which seriously restricts the performance improvement of the storage system. This paper studies the "storage wall" problem in microprocessor design. On the basis of analyzing the challenging problems and potential optimization space facing private, sharing, and mixed Cache structures, the paper explores the Cache data management mechanism for the multi-core processor for delay optimization. There is a problem of capacity failure for Cache structure. In this paper, an inter kernel capacity sharing mechanism based on fine-grained pseudo partition (CSFP) is proposed. By setting a weighted saturation counter array for each Cache Bank at a fine-grained level, the difference of the memory requirements of each thread is calculated and predicted, and the privacy of each processor kernel on each Cache Set is controlled. The division of domain and shared domain is used to guide the replacement of sacrificial blocks on the core of each processor, overflowing and receiving decision, using the intelligent inter nuclear capacity borrowing mechanism to balance the difference of the memory demand between processors, thus alleviating the capacity failure of the multi-core private Cache structure. In a periodic and accurate system structure level whole system simulator S On the imics platform, the performance evaluation and analysis of the CSFP mechanism under the 16 core tile structure are carried out. The experiment shows that the CSFP mechanism can effectively improve the capacity failure of the multi-core private Cache structure. The operation time of the multithreaded test program can be compressed by about 8.57%. Second. In this paper, this paper proposes a collision failure isolation mechanism based on deflection mapping, IMI-SM. When the failure of the last shared Cache in multi-core Cache is taken from the external memory, if the LRU candidate sacrificial block in the static target Cache Set is excommunicated, the thread or thread may be caused. By introducing a dedicated conflict isolation cache or using a longitudinal pressure balancing strategy within Bank to extend the candidate target Set selection range for data mapping, IMI-SW allows the new data blocks from memory to be stored in the inrush isolated buffer zone or relatively small storage pressure by the introduction of a dedicated conflict isolation cache zone. Other static coupling Cache Set, in order to alleviate the negative impact of conflict failure on the overall hit rate on shared Cache chips. The experimental results show that IMI-SM can significantly reduce the collision failure that the multicore processor faces when sharing Cache resources, and the program run time can be reduced by about 7.35%, so it can be used with smaller hardware. Third, in view of the long delay hit problem facing the distributed shared Cache structure of the tile type multi-core processor, an enhanced selective sacrificial block replication mechanism, E-VR, is proposed in this paper. On the basis of the original sacrificial block replication operation, the candidate sacrificial block filter and target group detection mechanism are introduced. The sacrificial block copy operation takes into account not only the sharing mode and the reading and writing characteristics, but also the longitudinal nonequilibrium distribution characteristics of the local Cache Bank from the fine-grained level. By reducing the occurrence probability of the high cost replication operation and extending the selection range of the candidate storage target Set for the sacrificial block, the performance benefit of the replication operation is improved. The results show that E-VR can reduce the running time of each application by about 6.97%.E-VR, while reducing the access delay of the hit on the chip, avoiding the negative impact on the global hit rate of the shared Cache, and can make a dynamic tradeoff between the low hit delay and the low loss efficiency, and the performance of the memory visiting system can be further improved. Four, facing the virtual shared domain partition structure of tile type multi-core distributed Cache, this paper proposes the adaptive replacement of data, the set of migration and replication mechanism, which is a unified data management framework, F-RMR.F-RMR not only can perceive the active state and the uniqueness of the candidate sacrificial blocks in the local target Cache Set when the data is replaced, but also in many cases. Data migration and replication decision-making between virtual shared domains can collaborate to perceive the active degree of the hit data and the idle state of the target Cache Set. By substitution, the collaboration between the three parties of the migration and replication, the problem of the conflict of weights between the long Cache long delay and the effective utilization ratio on the chip is properly handled. The experimental results show that When the shared domain partition granularity is 4, the average storage access delay of the multithread test program under F-RMR can be reduced by about 7.59%. Compared with the original virtual shared domain partition mechanism, F-RMR can get the corresponding performance enhancement in the granularity of different shared domains, and the area overhead can be ignored.
【學(xué)位授予單位】：國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2013
【分類號】：TP332

【相似文獻】

相關(guān)期刊論文前10條

1 劉美華,古志民,曹元大;Load Balancing Algorithm for Cache Cluster[J];Journal of Beijing Institute of Technology(English Edition);2003年04期

2 趙學(xué)梅,葉以正,李曉明,時銳;一種低功耗高性能的滑動Cache方案[J];計算機研究與發(fā)展;2004年11期

3 ;Design and Implementation of Hierarchy Cache Using Pagefile[J];Wuhan University Journal of Natural Sciences;2004年06期

4 VioLin;高容量L2Cache=高性能嗎[J];電腦應(yīng)用文萃;2004年10期

5 VioLin;電腦設(shè)備加速之Cache談[J];電腦應(yīng)用文萃;2004年11期

6 杜紅燕,田興彥,田新華;一種新穎的軟件可控Cache優(yōu)化方法[J];計算機工程與應(yīng)用;2005年21期

7 ;A Novel Cache Invalidation Scheme for Mobile Networks[J];Wuhan University Journal of Natural Sciences;2006年02期

8 唐雙燕;楊云仙;劉偉;;IA-32CPU Cache的一種特殊應(yīng)用[J];軟件導(dǎo)刊;2006年15期

9 楊君;李曦;仲力;周學(xué)海;;一種新型的嵌入式X路組相聯(lián)cache結(jié)構(gòu)[J];中國科學(xué)技術(shù)大學(xué)學(xué)報;2007年02期

10 趙昊翔;;從程序員的角度看Cache[J];程序員;2008年09期

相關(guān)會議論文前10條

1 所光;楊學(xué)軍;;雙核處理器性能最優(yōu)的共享Cache劃分[A];2008年全國開放式分布與并行計算機學(xué)術(shù)會議論文集(上冊)[C];2008年

2 石文強;倪曉強;金作霖;張民選;;Cache動態(tài)插入策略模型研究[A];第十五屆計算機工程與工藝年會暨第一屆微處理器技術(shù)論壇論文集（B輯）[C];2011年

3 汪騰;楊少軍;;一種高效的指令Cache的結(jié)構(gòu)[A];中國聲學(xué)學(xué)會2001年青年學(xué)術(shù)會議[CYCA'01]論文集[C];2001年

4 ;Application of cache in Data Access Performance Optimization[A];2011年全國電子信息技術(shù)與應(yīng)用學(xué)術(shù)會議論文集[C];2011年

5 李凡;李建中;何震瀛;;XML數(shù)據(jù)Cache策略研究[A];第二十二屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2005年

6 ;Research on WEB Cache Prediction Recommend Mechanism Based on Usage Pattern[A];中國電子學(xué)會第十五屆信息論學(xué)術(shù)年會暨第一屆全國網(wǎng)絡(luò)編碼學(xué)術(shù)年會論文集（上冊）[C];2008年

7 宋杰;欒影;王廣奇;于戈;王大玲;;OR-Cache:一種有效的對象-關(guān)系映射模型[A];第二十三屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2006年

8 張承義;郭維;周宏偉;;Cache漏流功耗的自適應(yīng)優(yōu)化:動態(tài)容量調(diào)整[A];第十五屆計算機工程與工藝年會暨第一屆微處理器技術(shù)論壇論文集（B輯）[C];2011年

9 鄭涵;吳英;丁曉東;樂嘉錦;;基于Web的個性化智能Cache庫[A];第二十屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2003年

10 周旋;馮玉才;李碧波;孫小薇;;多服務(wù)器DBMS的Cache管理[A];數(shù)據(jù)庫研究與進展95——第十三屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集[C];1995年

相關(guān)重要報紙文章前10條

1 上海李超;什么是Cache[N];電腦報;2001年

2 徐春梅;國際品牌進入中國：適應(yīng)是關(guān)鍵[N];中國經(jīng)營報;2006年

3 劉昌勇;小緩存里的大學(xué)問[N];中國電腦教育報;2004年

4 超頻者;K7-650（0015）最新實超報告[N];大眾科技報;2000年

5 ;阿萌小辭典[N];電腦報;2004年

6 山東黃家貞;富有個性的離線瀏覽器——CacheX[N];電腦報;2001年

7 中國計算機報測試實驗室王炳晨;Duron抵京，Thunderbird爭宏[N];中國計算機報;2000年

8 廣東李鋒;妙用 Cache 優(yōu)化 Windows 2000[N];電腦報;2001年

9 巖公;電信網(wǎng)加速不難[N];中國計算機報;2003年

10 北京共創(chuàng)開源軟件股份有限公司董孝峰;共創(chuàng)NC的設(shè)計與實現(xiàn)[N];中國計算機報;2004年

相關(guān)博士學(xué)位論文前10條

1 黃安文;面向延遲優(yōu)化的多核處理器Cache數(shù)據(jù)管理機制研究[D];國防科學(xué)技術(shù)大學(xué);2013年

2 周宏偉;微處理器中Cache漏流功耗的體系結(jié)構(gòu)級優(yōu)化技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2007年

3 田新華;面向性能優(yōu)化的壓縮cache技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2007年

4 陳黎明;嵌入式微處理器中動態(tài)可配置Cache結(jié)構(gòu)的研究[D];華中科技大學(xué);2009年

5 付雄;利用程序分析和優(yōu)化提高Cache性能[D];中國科學(xué)技術(shù)大學(xué);2007年

6 賈小敏;多核處理器片上Cache訪問行為分析與優(yōu)化機制研究[D];國防科學(xué)技術(shù)大學(xué);2011年

7 唐軼軒;面向多線程應(yīng)用的Cache優(yōu)化策略及并行模擬研究[D];中國科學(xué)技術(shù)大學(xué);2012年

8 項曉燕;體系結(jié)構(gòu)級Cache功耗優(yōu)化技術(shù)研究[D];浙江大學(xué);2013年

9 所光;面向科學(xué)計算應(yīng)用的多核處理器Cache劃分策略研究[D];國防科學(xué)技術(shù)大學(xué);2009年

10 彭蔓蔓;體系結(jié)構(gòu)級低能耗Cache和動態(tài)電壓縮放技術(shù)研究[D];湖南大學(xué);2007年

相關(guān)碩士學(xué)位論文前10條

1 舒晰;支持多媒體計算的可重構(gòu)Cache研究與設(shè)計[D];湖南大學(xué);2008年

2 楊向峰;一種32位DSP cache的設(shè)計與驗證技術(shù)研究[D];江南大學(xué);2008年

3 蘇小昆;基于Tournament Caching的低功耗動態(tài)可重構(gòu)Cache研究[D];湖南大學(xué);2009年

4 郝玉艷;嵌入式系統(tǒng)中低功耗Cache的研究與設(shè)計[D];湖南大學(xué);2009年

5 潘麗君;動態(tài)二進制翻譯器中Code Cache管理策略的研究與分析[D];上海交通大學(xué);2009年

6 劉彬;基于路暫停方法的高性能低功耗Cache研究[D];湖南大學(xué);2007年

7 彭方;路預(yù)測與可重構(gòu)Cache的自適應(yīng)低能耗算法研究[D];湖南大學(xué);2008年

8 胡濤;面向存儲器完整性驗證的Cache設(shè)計[D];華中科技大學(xué);2011年

9 劉清;嵌入式系統(tǒng)中低功耗可重構(gòu)Cache的研究與設(shè)計[D];湖南大學(xué);2012年

10 李冬妮;嵌入式系統(tǒng)中低功耗Cache的重構(gòu)技術(shù)研究[D];湖南大學(xué);2012年

，

本文編號：1983476

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1983476.html

上一篇：基于ARM光伏板控制系統(tǒng)研究和實現(xiàn)
下一篇：GDI打印驅(qū)動自動化測試系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向延遲優(yōu)化的多核處理器Cache數(shù)據(jù)管理機制研究