多寬度SIMD結構DSP向量存儲器的設計與實現(xiàn)
發(fā)布時間:2018-08-04 13:46
【摘要】:近十年來,隨著集成電路技術和計算機技術的發(fā)展,中央處理器的性能每年增長近60%,而存儲器存取延遲每年僅改善7%[1],存儲器訪問帶寬和延遲造成的“存儲墻”問題已成為制約微處理器性能進一步提高的瓶頸。面向高密集度數(shù)據(jù)處理的多寬度SIMD結構數(shù)字信號處理器(Digital Signal Processor,DSP)片內集成了多個向量處理單元,對數(shù)據(jù)并行訪存性能提出了更高的要求。如何為多寬度SIMD數(shù)字信號處理器的向量處理單元提供充足的訪存帶寬、減少向量處理單元之間數(shù)據(jù)的混洗等額外操作、提高算法的訪存效率和降低功耗,成為向量存儲系統(tǒng)設計中面臨的重要問題。 YHFT-Matrix是國防科學技術大學微電子與微處理器研究所自主研發(fā)的一款面向軟基站的自主知識產(chǎn)權高性能DSP,采用10發(fā)射超長指令字和多寬度SIMD結構,其向量處理部件(VPU)包含16個同構的向量處理單元,每個向量處理單元包含兩個乘加單元和其他ALU單元,需要較高的數(shù)據(jù)吞吐率和訪存帶寬才能充分發(fā)揮VPU的運算能力。本文根據(jù)YHFT-Matrix設計需求和通信算法的訪存特點,設計并實現(xiàn)了一種高效、新型的片上大容量向量存儲器(Vector Memory,VM),緩存VPU運算所需的大量數(shù)據(jù)。 VM設計了專用的向量地址產(chǎn)生單元,支持線性和循環(huán)尋址;存儲容量為1MB,,存儲體采用多體雙緩沖結構,按低位地址交叉編址,以較小的面積和功耗代價實現(xiàn)了多路向量數(shù)據(jù)的并行訪問,有效減少了并行訪存沖突。為加速相關通信算法,在VM中還實現(xiàn)了一種向量訪問重整理單元和向量寫回重整理單元,使VM能支持向量非對齊訪問和向量條件訪問,實現(xiàn)了向量處理部件中所有向量處理單元對VM存儲空間的有限共享和條件訪存,實現(xiàn)了最大可同時支持512Gbps的向量、256Gbps的DMA和32Gbps的標量數(shù)據(jù)訪問性能;經(jīng)過后期改進VM還可實現(xiàn)連續(xù)向量字節(jié)和半字的訪問。 目前基于四個YHFT-Matrix內核的多核DSP芯片YHFT-QMBase已成功投片,前期的邏輯驗證和后期的芯片測試表明,所設計的VM功能正確,基于65nm工藝的芯片主頻能達到500MHz以上,經(jīng)后期邏輯優(yōu)化后主頻能達到700MHz;使用VM多體交叉雙緩沖結構可大幅減少訪問沖突;有限共享和向量條件的存儲結構能減少或消除相關算法的混洗操作,壓縮了代碼密度,加速了相關算法的執(zhí)行。
[Abstract]:In the last decade, with the development of integrated circuit technology and computer technology, The performance of CPU increases nearly 60% per year, while the memory access delay improves only 7% per year. The problem of "memory wall" caused by memory access bandwidth and delay has become the bottleneck restricting the further improvement of microprocessor performance. Multi-width digital signal processor (Digital Signal processor) for high density data processing (SIMD) is integrated with several vector processing units (VPs), which requires higher performance of data parallel memory access. How to provide sufficient memory access bandwidth for the vector processing unit of multi-width SIMD digital signal processor, reduce the additional operations such as data shuffling between vector processing units, improve the efficiency of the algorithm and reduce the power consumption. YHFT-Matrix is an independent developed by the Institute of Microelectronics and Microprocessor of National University of National Defense Science and Technology, which is an independent intellectual property high performance DSPs for soft base stations. Transmit ultra-long instruction words and multi-width SIMD structures, The vector processing unit (VPU) consists of 16 isomorphic vector processing units. Each vector processing unit consists of two multiplication and addition units and other ALU units. It requires high data throughput and memory access bandwidth to give full play to the computing power of VPU. According to the design requirements of YHFT-Matrix and the memory access characteristics of communication algorithm, this paper designs and implements a kind of high efficiency. A new type of Vector memory (VVM) is designed to cache a large amount of data needed for VPU operation. VM designs a special vector address generation unit to support linear and cyclic addressing, with a storage capacity of 1MB and a multi-body double-buffer structure. The parallel access of multipath vector data is realized at the cost of small area and power consumption, and the parallel access conflict is reduced effectively. In order to accelerate the correlation communication algorithm, a vector access rearrangement unit and a vector write-back rearrangement unit are implemented in VM, which enables the VM to support vector unaligned access and vector conditional access. The limited sharing of VM storage space and conditional memory access by all vector processing units in vector processing unit are realized, and the scalar data access performance of vector 256Gbps DMA and 32Gbps which can support 512Gbps at the same time is realized. After the later improvement VM can also achieve continuous vector byte and half word access. At present, the multi-core DSP chip YHFT-QMBase based on four YHFT-Matrix cores has been successfully put into the chip. The previous logical verification and the later chip test show that the function of the VM designed is correct, and the main frequency of the chip based on 65nm technology can reach 500MHz. After the later logical optimization, the main frequency can reach 700 MHz; using VM multi-body cross double buffer structure can greatly reduce the access conflict; the storage structure with finite sharing and vector condition can reduce or eliminate the shuffling operation of the related algorithm and compress the code density. Speed up the implementation of related algorithms.
【學位授予單位】:國防科學技術大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP333
本文編號:2164073
[Abstract]:In the last decade, with the development of integrated circuit technology and computer technology, The performance of CPU increases nearly 60% per year, while the memory access delay improves only 7% per year. The problem of "memory wall" caused by memory access bandwidth and delay has become the bottleneck restricting the further improvement of microprocessor performance. Multi-width digital signal processor (Digital Signal processor) for high density data processing (SIMD) is integrated with several vector processing units (VPs), which requires higher performance of data parallel memory access. How to provide sufficient memory access bandwidth for the vector processing unit of multi-width SIMD digital signal processor, reduce the additional operations such as data shuffling between vector processing units, improve the efficiency of the algorithm and reduce the power consumption. YHFT-Matrix is an independent developed by the Institute of Microelectronics and Microprocessor of National University of National Defense Science and Technology, which is an independent intellectual property high performance DSPs for soft base stations. Transmit ultra-long instruction words and multi-width SIMD structures, The vector processing unit (VPU) consists of 16 isomorphic vector processing units. Each vector processing unit consists of two multiplication and addition units and other ALU units. It requires high data throughput and memory access bandwidth to give full play to the computing power of VPU. According to the design requirements of YHFT-Matrix and the memory access characteristics of communication algorithm, this paper designs and implements a kind of high efficiency. A new type of Vector memory (VVM) is designed to cache a large amount of data needed for VPU operation. VM designs a special vector address generation unit to support linear and cyclic addressing, with a storage capacity of 1MB and a multi-body double-buffer structure. The parallel access of multipath vector data is realized at the cost of small area and power consumption, and the parallel access conflict is reduced effectively. In order to accelerate the correlation communication algorithm, a vector access rearrangement unit and a vector write-back rearrangement unit are implemented in VM, which enables the VM to support vector unaligned access and vector conditional access. The limited sharing of VM storage space and conditional memory access by all vector processing units in vector processing unit are realized, and the scalar data access performance of vector 256Gbps DMA and 32Gbps which can support 512Gbps at the same time is realized. After the later improvement VM can also achieve continuous vector byte and half word access. At present, the multi-core DSP chip YHFT-QMBase based on four YHFT-Matrix cores has been successfully put into the chip. The previous logical verification and the later chip test show that the function of the VM designed is correct, and the main frequency of the chip based on 65nm technology can reach 500MHz. After the later logical optimization, the main frequency can reach 700 MHz; using VM multi-body cross double buffer structure can greatly reduce the access conflict; the storage structure with finite sharing and vector condition can reduce or eliminate the shuffling operation of the related algorithm and compress the code density. Speed up the implementation of related algorithms.
【學位授予單位】:國防科學技術大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP333
【參考文獻】
相關期刊論文 前2條
1 唐承佩;周建;倪江群;;多端共訪存儲器的競爭仲裁模塊的實現(xiàn)研究[J];電子測量與儀器學報;2008年02期
2 宋楊,劉振宇,汪東升;基于兩優(yōu)先級輪轉法的PCI仲裁器的設計與實現(xiàn)[J];微電子學;2004年06期
本文編號:2164073
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2164073.html
最近更新
教材專著