嵌入式處理器的微體系結(jié)構(gòu)優(yōu)化
發(fā)布時(shí)間:2018-05-31 04:44
本文選題:嵌入式系統(tǒng) + 微處理器 ; 參考:《浙江大學(xué)》2013年碩士論文
【摘要】:生產(chǎn)工藝的不斷進(jìn)步以及新興應(yīng)用程序的要求不斷驅(qū)動(dòng)著處理器性能的飛速提升。然而嵌入式處理器面臨著新的挑戰(zhàn):一方面,存儲(chǔ)器與處理器的性能差距不斷制約著處理器的整體系統(tǒng)性能;另一方面,大量新應(yīng)用的高精度浮點(diǎn)要求對(duì)處理器設(shè)計(jì)提出了新的需求。本文通過分析應(yīng)用特性,采用數(shù)據(jù)預(yù)取優(yōu)化處理器存儲(chǔ)系統(tǒng),并設(shè)計(jì)浮點(diǎn)單元加速處理器數(shù)據(jù)處理。 主流的預(yù)取機(jī)制設(shè)計(jì)和配置并不適用嵌入式處理器:過于激進(jìn)的預(yù)取策略會(huì)干擾處理器正常訪存行為;復(fù)雜的預(yù)測(cè)和控制機(jī)制會(huì)消耗大量功耗和面積。本文設(shè)計(jì)了一種基于流信息表的可變步長(zhǎng)流預(yù)取機(jī)制。通過優(yōu)化的最小差值法對(duì)數(shù)據(jù)流進(jìn)行判定和過濾,降低電路設(shè)計(jì)復(fù)雜度;同時(shí)通過設(shè)置預(yù)取緩沖降低高速緩存(cache)端口沖突率;并對(duì)預(yù)取數(shù)據(jù)采用單獨(dú)的cache替換策略,彌補(bǔ)因?yàn)閏ache的污染對(duì)預(yù)取效果造成的負(fù)面影響。NoCOP硬件模擬平臺(tái)仿真結(jié)果顯示,針對(duì)EEMBC和SPEC2006測(cè)試集,本文的流預(yù)取機(jī)制相較于無預(yù)取時(shí),平均性能提升比例為4.3%,性能最大提升16%;相較于MSP (minimum delta prefetching)機(jī)制,平均性能提升10.5%;面積增加3.5萬等效門,總功耗增加30.1mW。 目前大多數(shù)預(yù)取機(jī)制并不能同時(shí)兼顧流式和鏈?zhǔn)綌?shù)據(jù)結(jié)構(gòu),且已有的鏈?zhǔn)筋A(yù)取機(jī)制存在著存儲(chǔ)空間開銷大或預(yù)取準(zhǔn)確度低的問題。本文設(shè)計(jì)了集成流預(yù)取引擎和指針預(yù)取引擎的自適應(yīng)多模式預(yù)取系統(tǒng),根據(jù)處理器實(shí)時(shí)運(yùn)行信息判斷當(dāng)前工作模式效率,并完成在流預(yù)取、指針預(yù)取和無預(yù)取二種模式下的切換調(diào)整。其中,我們?cè)O(shè)計(jì)的FCDP(filtered content directed prefetching)指針預(yù)取機(jī)制,通過基于偏移地址的過濾方法對(duì)CDP(content directed prefetching)機(jī)制進(jìn)行準(zhǔn)確率的優(yōu)化,可以平均降低35%的預(yù)取發(fā)起數(shù)量。NoCOP硬件模擬平臺(tái)仿真結(jié)果顯示,針對(duì)EEMBC、SPEC2006和Olden測(cè)試集,預(yù)取系統(tǒng)與單獨(dú)采用流預(yù)取和FCDP預(yù)取時(shí)分別提升11.7%和50.6%,能在預(yù)取效果不理想時(shí)及時(shí)關(guān)閉預(yù)取引擎,降低系統(tǒng)功耗。 根據(jù)新應(yīng)用大量的浮點(diǎn)數(shù)據(jù),以及越來越高的精度要求,本文設(shè)計(jì)了適用于嵌入式處理器的浮點(diǎn)單元,用于加速浮點(diǎn)數(shù)據(jù)的處理。同時(shí),提出了利用軟件模擬器統(tǒng)計(jì)應(yīng)用特性來指導(dǎo)RTL(register transfer level)級(jí)設(shè)計(jì)的方法實(shí)例。浮點(diǎn)單元設(shè)計(jì)采用load/store與浮點(diǎn)算術(shù)指令分開處理的方式,高度復(fù)用了原整型流水線的邏輯單元,并與整型流水線緊密耦合。實(shí)驗(yàn)與邏輯綜合結(jié)果表明,浮點(diǎn)單元支持MIPS32單精度浮點(diǎn)指令集;在worst case下最大工作頻率為495MHz,在typical case下最大工作頻率為794MHz;面積增加24.8萬等效門,功耗為88.3mW。
[Abstract]:The continuous progress of production technology and the requirements of emerging applications continuously drive the rapid improvement of processor performance. However, embedded processors face new challenges: on the one hand, the performance gap between memory and processor constantly restricts the overall system performance of the processor; on the other hand, The high precision floating-point requirement of a large number of new applications puts forward new requirements for processor design. In this paper, data prefetching is used to optimize processor storage system, and floating-point unit is designed to accelerate processor data processing. The design and configuration of the main prefetching mechanism is not suitable for embedded processor: overly aggressive prefetching strategy will interfere with the normal memory access behavior of the processor; complex prediction and control mechanisms will consume a lot of power and area. A variable step long flow prefetching mechanism based on stream information table is designed in this paper. The data stream is judged and filtered by the optimized minimum difference method to reduce the complexity of circuit design, the collision rate of cache port is reduced by setting prefetching buffer, and a separate cache replacement strategy is adopted for prefetched data. The simulation results of the hardware simulation platform of Nocop show that for EEMBC and SPEC2006 test sets, the stream prefetching mechanism in this paper is better than that without prefetching. The average performance improvement ratio is 4.3%, the maximum performance improvement is 16%; compared with the MSP minimum delta prefetching mechanism, the average performance increases 10.5%; the area increases 35000 equivalent gates, the total power consumption increases 30.1 MW. At present, most prefetching mechanisms can not take both streaming and chained data structures into account, and the existing chained prefetching mechanisms have the problems of high storage space overhead and low prefetching accuracy. In this paper, an adaptive multi-mode prefetching system based on integrated stream prefetching engine and pointer prefetching engine is designed. According to the real-time operation information of the processor, the efficiency of the current working mode is judged, and the in-stream prefetching is completed. Pointer pre-fetch and no pre-fetching two modes of switch adjustment. Among them, the FCDP(filtered content directed prefetching) pointer prefetching mechanism designed by us can optimize the accuracy of the CDP(content directed prefetching mechanism by filtering based on offset address, and can reduce the number of prefetching initiators by an average of 35%. For the EEMBC / SPEC2006 and Olden test sets, the prefetching system and single stream prefetching and FCDP prefetching can increase 11.7% and 50.6% respectively, which can shut down the prefetching engine in time and reduce the system power consumption when the prefetching effect is not satisfactory. According to the new application of a large number of floating-point data, as well as higher and higher precision requirements, this paper designed a floating-point unit suitable for embedded processor to speed up the processing of floating-point data. At the same time, an example of how to use statistical application characteristics of software simulator to guide RTL(register transfer level design is presented. The floating-point unit is designed by using load/store and floating-point arithmetic instruction separately. It highly reuses the logic unit of the original integer pipeline and is tightly coupled with the integer pipeline. The experimental and logical synthesis results show that the floating-point unit supports the MIPS32 single precision floating-point instruction set, the maximum operating frequency is 495MHz in worst case and 794MHz in typical case, and the area increase is 248000 equivalent gates, and the power consumption is 88.3mW.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP332
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 孟小甫;高翔;從明;張爽爽;;龍芯3A多核處理器系統(tǒng)級(jí)性能優(yōu)化與分析[J];計(jì)算機(jī)研究與發(fā)展;2012年S1期
2 劉鵬;鐘耿;徐國(guó)柱;鄔可俊;;基于調(diào)試異常模型的嵌入式處理器片上調(diào)試設(shè)計(jì)[J];浙江大學(xué)學(xué)報(bào)(工學(xué)版);2010年06期
相關(guān)博士學(xué)位論文 前1條
1 劉揚(yáng)帆;硬件事務(wù)存儲(chǔ)微體系結(jié)構(gòu)及其驗(yàn)證研究[D];浙江大學(xué);2012年
,本文編號(hào):1958332
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1958332.html
最近更新
教材專著