片上多核處理器緩存子系統(tǒng)優(yōu)化的研究

發(fā)布時(shí)間：2019-01-21 13:17

【摘要】：當(dāng)前的片上多核處理器需要大容量的緩存系統(tǒng)來降低快速的處理器和慢速的片下主存之間的性能差距。本文認(rèn)為可以利用和挖掘片上多核處理器的特性來優(yōu)化其緩存子系統(tǒng)的性能和功耗。本文的工作研究了幾個(gè)優(yōu)化片上多核處理器緩存子系統(tǒng)性能的機(jī)制。具體來說,本文的研究主題包含三個(gè)方面：1)研究和設(shè)計(jì)高效的多播路由算法來提升片上網(wǎng)絡(luò)的性能；2)利用當(dāng)前的新型的非易失性存儲(chǔ)器來為片上多核處理器設(shè)計(jì)低功耗的緩存系統(tǒng)；3)挖掘利用線程的進(jìn)度信息來設(shè)計(jì)更加高效的緩存一致性協(xié)議。針對(duì)第一個(gè)研究主題,我們提出了一種高效的片上網(wǎng)絡(luò)多播路由機(jī)制。對(duì)于集成越來越多核的片上多核處理器來說,片上網(wǎng)絡(luò)為其提供了一個(gè)高效的、可擴(kuò)展的通信基礎(chǔ)架構(gòu)。對(duì)于多核架構(gòu)下的片上網(wǎng)絡(luò)來說,一對(duì)多的通信模式是很普遍的。沒有有效的多播路由機(jī)制的支持,傳統(tǒng)的基于單播的片上網(wǎng)絡(luò)在處理這些多播通信時(shí)是很低效的。本文提出了一個(gè)基于網(wǎng)絡(luò)劃分的多播路由機(jī)制,簡(jiǎn)稱DPM。DPM可以高效地減低片上網(wǎng)絡(luò)中網(wǎng)絡(luò)包的平均傳輸延遲以及降低片上網(wǎng)絡(luò)的功耗。具體來說,DPM可以根據(jù)當(dāng)前網(wǎng)絡(luò)中負(fù)載均衡級(jí)別以及多播通信的鏈路共享特征來動(dòng)態(tài)地進(jìn)行路由選擇。本文的第二個(gè)研究課題是利用一種新型的非易失性存儲(chǔ)器(自旋轉(zhuǎn)移矩隨機(jī)訪問存儲(chǔ)器,STT-RAM)來為片上多核處理器設(shè)計(jì)低功耗的緩存。STT-RAM具有快速的訪問速度、高存儲(chǔ)密度以及可以忽略不計(jì)的泄露功率。然而,大規(guī)模地應(yīng)用STT-RAM作為多核處理器的緩存受到STT-RAM的較長(zhǎng)的寫延遲以及較高的寫功耗的約束。最近研究表明過降低STT-RAM的存儲(chǔ)單元(磁性隧道結(jié)MTJ)的數(shù)據(jù)保持時(shí)間可以有效地提升其寫性能。但是保持時(shí)間降低的STT-RAM是易失性的,需要通過周期性地刷新其存儲(chǔ)單元來避免數(shù)據(jù)丟失。當(dāng)這樣的STT-RAM用于多核的最后一級(jí)緩存(LLC)時(shí),頻繁的刷新操作在加劇能量消耗的同時(shí)也會(huì)給系統(tǒng)的性能帶來負(fù)面影響。文本提出了一種高效的刷新方案(簡(jiǎn)稱CCear)可以最小化這類STT-RAM上的刷新操作。CCear主要通過與緩存一致性協(xié)議以及緩存管理算法進(jìn)行交互來消除不必要的刷新操作。最后我們提出了一個(gè)高效的一致性協(xié)議的調(diào)整機(jī)制來優(yōu)化運(yùn)行在片上多核處理器上的并行程序的性能。片上多核處理器的一個(gè)主要目標(biāo)就是通過挖掘線程級(jí)別的并行性來繼續(xù)提升應(yīng)用程序的性能。但是對(duì)于運(yùn)行在這類系統(tǒng)上的多線程程序來說,由于不均勻的任務(wù)分配以及共享資源的沖突,不同的線程通常呈現(xiàn)出不同的執(zhí)行進(jìn)度。這種進(jìn)度的不均勻性是多線程程序性能的最大的瓶頸之一。由于多線程程序內(nèi)在的同步機(jī)制,如內(nèi)存屏障和鎖,運(yùn)行具有較快進(jìn)度的線程的核必須停下來等待進(jìn)度較慢的核。這樣的空等不僅會(huì)降低系統(tǒng)性能,也會(huì)導(dǎo)致功耗的浪費(fèi)。本文提出了一種線程進(jìn)度感知的一致性調(diào)整機(jī)制,簡(jiǎn)稱TEACA。TEACA利用線程的進(jìn)度信息來動(dòng)態(tài)地調(diào)整每個(gè)線程的一致性策略,目的是提升片上網(wǎng)絡(luò)帶寬資源的使用效率以及降低功耗。具體來說,TEACA動(dòng)態(tài)地將線程劃分為二類：領(lǐng)導(dǎo)者線程與落后者線程。隨后,TEACA會(huì)根據(jù)線程來類別信息為其一致性請(qǐng)求提供特定的一致性策略。
[Abstract]:the current slice-on-chip multi-core processor requires a high-capacity caching system to reduce the performance gap between a fast processor and a slow chip. It is considered that the performance and power consumption of the cache sub-system can be optimized by using and digging the characteristics of the multi-core processor on the chip. In this paper, the mechanism of multi-core processor cache sub-system performance on several optimization slices is studied. In particular, the research topic in this paper includes three aspects: 1) research and design efficient multicast routing algorithm to improve the performance of the network on the chip; 2) use the current new non-volatile memory to design a low-power cache system for the multi-core processor on the chip; and 3) mining the progress information of the utilization thread to design a more efficient cache coherence protocol. For the first subject of the study, we propose a high-efficiency, on-chip, network-multicast routing machine for multi-core processors with more and more cores, the on-chip network provides an efficient, scalable communication infrastructure architecture. For an on-chip network under a multi-core architecture, a large number of communication modes are common. Without the support of a valid multicast routing mechanism, conventional unicast-based on-chip networks are inefficient in handling these multicast communications This paper presents a network-based multicast routing mechanism, called DPM. DPM can effectively reduce the average transmission delay of the network packets in the network and reduce the work of the network on the chip in particular, DPM can dynamically route that route in accordance with the load balance level in the current network and the link share characteristics of the multicast communication The second subject of this paper is to use a new non-volatile memory (spin transfer moment random access memory, STT-RAM) to design low power consumption for multi-core processors on the chip The cache. STT-RAM has a fast access speed, a high storage density, and a negligible drain however, large-scale application of STT-RAM as that cache of the multi-core processor is subject to a longer write delay of the STT-RAM and high write power consumption The recent study has shown that the data retention time of a memory cell (magnetic tunnel junction MTJ) that has reduced the STT-RAM can effectively increase it Write performance. However, the STT-RAM with reduced retention time is easy to lose, and it is necessary to avoid the number by periodically refreshing its storage unit It is lost. When such STT-RAM is used for the last-level cache (LLC) of a multi-core, frequent refresh operations will also negatively impact the performance of the system while increasing energy consumption The text provides a high-efficiency refresh scheme (CCear) that minimizes the brush on this class of STT-RAM The new operation. The CCear eliminates unnecessary brush by interacting with the cache coherency protocol and the cache management algorithm New operation. Finally, we put forward an efficient consistency protocol adjustment mechanism to optimize the parallelism of the multi-core processor running on the chip The performance of the program. One of the main objectives of the multi-core processor on the chip is to continue to improve the application by digging the parallelism of the thread level the performance of a program. However, for a multi-threaded program running on this class of systems, different threads typically present different threads due to the non-uniform task assignment and the collision of the shared resource The progress of the execution of the progress. The non-uniformity of this progress is the maximum of the multi-threaded program performance One of the bottlenecks in a multi-threaded program, such as a memory barrier and a lock, and the kernel running a thread with a faster progress must stop and wait for entry a relatively slow core. Such an air, etc., will not only reduce the performance of the system, but also This paper presents a thread progress-aware consistency adjusting mechanism, called TEACA. The TEACA dynamically adjusts the consistency of each thread with the thread's progress information. The purpose of this paper is to improve the utilization efficiency of network bandwidth resources on the slice. in particular, that TEACA divide the thread into two types: leader thread and the latter thread. The TEACA then provides a specific request for its consistency request based on the thread's class information
【學(xué)位授予單位】：中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP332

【共引文獻(xiàn)】

相關(guān)期刊論文前4條

1 劉軼;吳名瑜;王永會(huì);錢德沛;;一種硬件事務(wù)存儲(chǔ)系統(tǒng)中的事務(wù)嵌套處理方案[J];電子學(xué)報(bào);2014年01期

2 Muhammad Abid Mughal;Hai-Xia Wang;Dong-Sheng Wang;;The Case of Using Multiple Streams in Streaming[J];International Journal of Automation and Computing;2013年06期

3 張駿;田澤;梅魁志;趙季中;;基于節(jié)點(diǎn)預(yù)測(cè)的直接Cache一致性協(xié)議[J];計(jì)算機(jī)學(xué)報(bào);2014年03期

4 馮超超;張民選;李晉文;戴藝;;一種可配置雙向鏈路的片上網(wǎng)絡(luò)容錯(cuò)偏轉(zhuǎn)路由器[J];計(jì)算機(jī)研究與發(fā)展;2014年02期

相關(guān)博士學(xué)位論文前5條

1 王慶;面向嵌入式多核系統(tǒng)的并行程序優(yōu)化技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2013年

2 朱素霞;面向多核處理器確定性重演的內(nèi)存競(jìng)爭(zhēng)記錄機(jī)制研究[D];哈爾濱工業(yè)大學(xué);2013年

3 楊兵;分簇超標(biāo)量處理器關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2009年

4 馮超超;片上網(wǎng)絡(luò)無緩沖路由器關(guān)鍵技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2012年

5 陳銳忠;非對(duì)稱多核處理器的若干調(diào)度問題研究[D];華南理工大學(xué);2013年

相關(guān)碩士學(xué)位論文前5條

1 閔銀皮;同構(gòu)通用流多核處理器存儲(chǔ)部件關(guān)鍵技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2012年

2 張岐;基于CMP的硬件事務(wù)存儲(chǔ)系統(tǒng)優(yōu)化技術(shù)研究[D];哈爾濱工程大學(xué);2013年

3 張杰;基于CMP的共享L2Cache管理策略研究[D];哈爾濱工程大學(xué);2013年

4 馬超;徽商銀行基金代銷自動(dòng)賬戶系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];大連理工大學(xué);2013年

5 王勛;面向非易失存儲(chǔ)器PCM的節(jié)能技術(shù)研究[D];浙江工業(yè)大學(xué);2013年

，

本文編號(hào)：2412699

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2412699.html

上一篇：基于嵌入式平臺(tái)的環(huán)境異常事件監(jiān)測(cè)
下一篇：基于ZigBee和ARM-Linux的無線傳感網(wǎng)與虛擬云桌面系統(tǒng)的設(shè)計(jì)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

片上多核處理器緩存子系統(tǒng)優(yōu)化的研究