多核處理器層次化存儲體系研究
本文選題:多核處理器 + 嵌入式應(yīng)用 ; 參考:《復(fù)旦大學(xué)》2012年碩士論文
【摘要】:近年來,以平板電腦、智能手機為代表的手持式消費電子產(chǎn)品獲得了前所未有的快速發(fā)展機遇,隨著產(chǎn)品的不斷升級,不斷提升的硬件配置水平帶動功耗需求不斷走高。處理器作為消費電子產(chǎn)品的核心部件,其技術(shù)需求特征逐漸從高性能轉(zhuǎn)向高性能與低功耗并舉。另一方面,隨著工藝更新的步伐逐漸放緩,依靠提高時鐘頻率以獲取性能增長的做法已經(jīng)被證明不可持續(xù),具有內(nèi)在并行性與靈活性的多核架構(gòu)已經(jīng)成為處理器的主流架構(gòu)。對于功耗敏感、種類繁多的嵌入式應(yīng)用而言,多核處理器內(nèi)在的并行處理能力、可擴展性和潛在的低功耗特征顯得尤其適用。 本文旨在通過研究面向嵌入式應(yīng)用的多核處理器的層次化存儲體系,在已有的典型處理器存儲架構(gòu)設(shè)計方案的基礎(chǔ)之上,提出了一種更為適用嵌入式多核處理器的存儲架構(gòu)。論文的研究目標(biāo)是通過層次化存儲架構(gòu)的創(chuàng)新設(shè)計,統(tǒng)籌考慮嵌入式應(yīng)用的高性能與低功耗需求,以滿足嵌入式應(yīng)用的技術(shù)需求特征。 論文的創(chuàng)新研究可以歸納為以下幾點: (1)簇狀結(jié)構(gòu)層次化存儲體系 本文提出了一類基于簇狀結(jié)構(gòu)的層次化存儲體系。該存儲體系針對嵌入式應(yīng)用的需求特征,優(yōu)化了存儲體系中各層次的權(quán)重:通過擴展寄存器文件設(shè)計增加了數(shù)據(jù)局部性,通過緩存缺省設(shè)計降低了存儲系統(tǒng)的硬件開銷,通過私有與共享數(shù)據(jù)存儲器的劃分提升了數(shù)據(jù)局部性,增強了存儲系統(tǒng)的層次性。 (2)擴展寄存器文件設(shè)計 在簇狀結(jié)構(gòu)層次化存儲體系中,本文提出了兼容32位指令位寬的寄存器文件擴展方案,將寄存器的數(shù)目擴展了一倍達到64個,增強了數(shù)據(jù)的局部性,提升了處理器的整體性能。同時,本文創(chuàng)新地利用了擴展寄存器文件所提供的地址映射空間,改進并優(yōu)化了消息傳遞核間通信機制,驗證結(jié)果表明該方案可以使與核間通信相關(guān)的指令數(shù)目減少達50%,有效提升了核間通信效率。 (3)緩存缺省設(shè)計 在簇狀結(jié)構(gòu)層次化存儲體系中,本文在處理器內(nèi)部采用了緩存缺省設(shè)計方案,取而代之的為私有存儲單元,節(jié)省了芯片面積并降低了系統(tǒng)的功耗開銷。本文同時提出了基于私有存儲單元的核間直接通信策略,通過對數(shù)據(jù)包頭格式的指定,消息傳遞核間通信可以不需要處理器核的參與,進一步提升了核間通信效率以及處理器的運算效率。 (4)簇內(nèi)共享存儲單元 在簇狀結(jié)構(gòu)層次化存儲體系中,本文設(shè)計了可以被簇內(nèi)所有處理器節(jié)點共享的存儲單元結(jié)構(gòu),并在該結(jié)構(gòu)基礎(chǔ)上提出了一種共享存儲核間通信機制以及相應(yīng)的信箱同步機制。通過將存儲單元劃分為私有存儲單元與共享存儲單元,數(shù)據(jù)的局部性得到提升,處理器訪存延遲問題得到優(yōu)化。 (5)芯片實現(xiàn)與應(yīng)用實例 采用該簇狀層次化存儲體系的一款16核處理器采用TSMC65納米低功耗CMOS制造工藝流程,芯片中包含兩個簇單元,每個簇單元包含八個處理器單元與一個簇內(nèi)共享存儲器單元。處理器芯片面積為9.1mm2,其中單個處理器核面積為0.43mm2,在1.2V供電電壓下最大時鐘頻率為750MHz;谠摱嗪颂幚砥,我們實現(xiàn)了3780點快速傅里葉變換模塊以評估層次化存儲體系對性能的提升效果及實際的功耗水平。測試結(jié)果表明單個處理器核的典型功耗為34mW,顯著低于其他同類型多核處理器。
[Abstract]:In recent years, handheld consumer electronic products, such as tablet computers and smartphones, have obtained unprecedented rapid development opportunities. With the continuous upgrading of products, the increasing hardware configuration level drives the power demand to be higher and higher. As the core component of the consumer electronic products, the technology demand features gradually from high sex. On the other hand, with the gradual slowdown in the pace of process updates, the practice of improving the clock frequency to gain performance has been proved unsustainable. The multi-core architecture with inherent parallelism and flexibility has become the main stream architecture of the processor. In terms of applications, multi-core processors are especially suitable for their parallel processing capability, scalability and low power consumption.
The purpose of this paper is to study the hierarchical storage system of multi core processors for embedded applications. On the basis of the existing design of typical processor storage architecture, a storage architecture which is more suitable for embedded multi-core processors is proposed. The research goal of this paper is to pass the innovative design of hierarchical storage architecture and take a comprehensive examination. Consider the high performance and low power requirements of embedded applications to meet the technical requirements of embedded applications.
The innovative research of this paper can be summarized as follows:
(1) hierarchical storage system of cluster structure
A hierarchical storage system based on cluster structure is proposed in this paper. This storage system optimizes the weight of all levels in the storage system according to the requirements of the embedded application. By extending the register file design, the data locality is increased, and the hardware overhead of the storage system is reduced by the default design of the cache. The division of shared data memory improves the locality of data and enhances the hierarchy of storage system.
(2) the design of the extended register file
In the hierarchical storage system of cluster structure, this paper proposes a register file extension scheme compatible with 32 bit instruction bit width, which extends the number of registers to 64, enhances the locality of the data and improves the overall performance of the processor. At the same time, this article innovally uses the address mapping provided by the extended register file. In addition, the communication mechanism of message transfer kernel is improved and optimized. The verification results show that the scheme can reduce the number of instructions related to inter nuclear communication by 50%, and effectively improves the efficiency of inter nuclear communication.
(3) cache default design
In the cluster structure hierarchical storage system, this paper uses the cache default design in the processor, instead of the private storage unit, saves the chip area and reduces the power consumption of the system. At the same time, this paper puts forward a direct connection communication strategy based on private storage unit, and specifies the data Baotou format. The message passing inter core communication can enhance the efficiency of inter core communication and the computing efficiency of the processor without the need of processor core.
(4) a shared memory cell in a cluster
In the cluster structure hierarchical storage system, this paper designs a storage unit that can be shared by all the processor nodes in the cluster. On the basis of this structure, a shared memory inter kernel communication mechanism and the corresponding mailbox synchronization mechanism are proposed. By dividing the storage unit into private storage unit and shared memory unit, the data is divided into a private storage unit and a shared memory unit. The locality of the processor is improved, and the delay of processor access is optimized.
(5) chip implementation and application examples
A 16 core processor using the hierarchical storage system uses a TSMC65 nano low power CMOS manufacturing process. The chip contains two cluster units, each cluster unit contains eight processor units and a shared memory unit in a cluster. The processor chip area is 9.1mm2, with a single core area of 0.43mm2, in 1.2V The maximum clock frequency of the power supply voltage is 750MHz. based on the multi core processor. We implement the 3780 point fast Fu Liye transform module to evaluate the performance enhancement effect and the actual power consumption level of the hierarchical storage system. The test results show that the typical power of the single processor core is 34mW, significantly lower than the other types of multi core processors.
【學(xué)位授予單位】:復(fù)旦大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP332
【相似文獻】
相關(guān)期刊論文 前10條
1 ;英特爾公司推出新一代Pentium Pro處理器[J];中國電子商情;1996年02期
2 ;Altera宣布為Nios Ⅱ處理器系統(tǒng)提供新的C語言至硬件加速工具[J];電子與電腦;2006年05期
3 ;汽車用GPS導(dǎo)航系統(tǒng)解決方案[J];世界電子元器件;2006年09期
4 徐鳳英;;Quad FX反戈一擊[J];新電腦;2007年02期
5 ;MCU應(yīng)用新世界:Cortex-M1微控制器和FPGA[J];世界電子元器件;2008年05期
6 岳陽;;領(lǐng)略英特爾“超線程”技術(shù)[J];電腦采購周刊;2002年46期
7 付漢杰;;利用NIOS Ⅱ處理器構(gòu)建節(jié)省成本的嵌入式系統(tǒng)[J];今日電子;2007年05期
8 ;要聞速遞[J];電腦采購周刊;2001年34期
9 劉磊;;對片上多核系統(tǒng)的系統(tǒng)結(jié)構(gòu)的研究[J];電腦知識與技術(shù);2008年29期
10 張越;;圖形工作站 升級雙核 Dell Precision 670[J];個人電腦;2006年02期
相關(guān)會議論文 前10條
1 單書暢;胡瑜;李曉維;;多核處理器的核級冗余容錯技術(shù)[A];第六屆中國測試學(xué)術(shù)會議論文集[C];2010年
2 張曉輝;程歸鵬;從明;;龍芯處理器上的TLB性能優(yōu)化技術(shù)[A];2010年第16屆全國信息存儲技術(shù)大會(IST2010)論文集[C];2010年
3 祁舒U,
本文編號:1853343
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1853343.html