基于性能監(jiān)測硬件支持的片上緩存資源管理技術(shù)
發(fā)布時(shí)間:2018-12-10 06:52
【摘要】:如何高效利用片上高速緩存是多核處理器研究的一個(gè)重要課題,F(xiàn)有的片上高速緩存管理機(jī)制是軟件透明的,不能實(shí)時(shí)感知程序數(shù)據(jù)集的局部性特征,以及來自多個(gè)線程不同的訪存請求。一方面,當(dāng)多個(gè)線程同時(shí)在多核處理器上運(yùn)行時(shí),現(xiàn)有的緩存管理策略不僅不能保證每個(gè)任務(wù)的運(yùn)行性能,還會導(dǎo)致共享緩存的多個(gè)任務(wù)之間發(fā)生不可預(yù)測的緩存競爭,形成相互干擾,降低系統(tǒng)的吞吐量。另一方面,由于軟件不能控制緩存空間的分配,僅靠硬件進(jìn)行管理,使得程序?qū)Ω咚倬彺娴睦眯什桓?尤其對于單線程程序,不能利用多核處理器豐富的片上緩存資源來獲得性能加速。 針對以上問題,本文研究如何利用硬件性能監(jiān)測單元來實(shí)時(shí)監(jiān)測程序運(yùn)行時(shí)的訪存特征信息,實(shí)現(xiàn)對多線程運(yùn)行時(shí)的共享緩存競爭管理,以及對單線程程序運(yùn)行時(shí)的緩存空間分配,從而提高多任務(wù)系統(tǒng)的吞吐量和性能穩(wěn)定性,并為單線程程序執(zhí)行提供高效的緩存控制手段。本文的研究內(nèi)容和主要工作成果包括以下幾個(gè)方面: (1)研究了能夠?qū)崟r(shí)感知程序運(yùn)行時(shí)訪存特征的性能監(jiān)測機(jī)制,提出了基于性能監(jiān)測單元而實(shí)現(xiàn)的低代價(jià)訪存性能監(jiān)測方案LWM。IWM可以為用戶層提供程序運(yùn)行時(shí)訪存性能信息的功能,以及為緩存管理器提供系統(tǒng)級的資源使用信息,減少了訪存性能監(jiān)測的代價(jià)。在實(shí)現(xiàn)過程中,我們在每個(gè)任務(wù)結(jié)構(gòu)體中加入性能事件成員、提供事件配置的系統(tǒng)調(diào)用接口,并且對計(jì)數(shù)器溢出和上下文切換過程中出現(xiàn)的錯(cuò)誤計(jì)數(shù)進(jìn)行了處理。此外,我們還優(yōu)化了性能計(jì)數(shù)器的分時(shí)復(fù)用機(jī)制,提高了多事件監(jiān)測過程中的事件監(jiān)測精度以及性能計(jì)數(shù)器的利用率。 (2)研究了多個(gè)任務(wù)對共享緩存資源的競爭問題,提出了訪存負(fù)載概念并設(shè)計(jì)了訪存負(fù)載平衡調(diào)度算法,提高了多任務(wù)系統(tǒng)吞吐量和程序的性能穩(wěn)定性。本文提出了一種訪存負(fù)載平衡調(diào)度技術(shù)來解決多任務(wù)共享緩存競爭問題。訪存負(fù)載平衡調(diào)度算法參照了操作系統(tǒng)計(jì)算負(fù)載平衡調(diào)度算法的設(shè)計(jì),可以作為操作系統(tǒng)負(fù)載平衡系統(tǒng)的擴(kuò)展。由于本文將訪存負(fù)載平衡調(diào)度實(shí)現(xiàn)為一個(gè)用戶層的負(fù)載調(diào)度系統(tǒng),所以不需要對操作系統(tǒng)內(nèi)核進(jìn)行改動(dòng)。通過與其它調(diào)度算法進(jìn)行實(shí)驗(yàn)比較后,證明本文提出的訪存負(fù)載平衡調(diào)度算法在程序加權(quán)加速,以及提升系統(tǒng)整體吞吐量方面都有較大改進(jìn),降低了多任務(wù)對共享緩存的競爭強(qiáng)度,減少了系統(tǒng)整體的片外訪存請求數(shù)量。得益于算法的穩(wěn)定性能,訪存負(fù)載平衡調(diào)度降低了程序多次運(yùn)行之間的性能差異性,可以為操作系統(tǒng)實(shí)現(xiàn)公平可靠的任務(wù)調(diào)度算法提供支持。 (3)研究了單線程程序運(yùn)行于多核處理器平臺時(shí)的緩存空間利用率不高的問題,提出了一種新型緩存控制機(jī)制VSCP,提高了單線程程序的緩存利用率并加速了程序執(zhí)行。本文提出的新型緩存控制方法VSCP可以有效提升單線程程序?qū)Χ嗪颂幚砥髌暇彺婵臻g的利用率,VSCP聯(lián)合了整個(gè)系統(tǒng)上的緩存資源并為程序員提供顯式的緩存控制接口,物理分布的緩存空間被虛擬化成用戶可控的集中式緩存。與通過程序并行化來最大化計(jì)算資源的使用不同,VSCP試圖去最大化緩存資源的利用率。VSCP保持單線程程序一段時(shí)間內(nèi)只使用一個(gè)處理器核的狀態(tài),減少多核同時(shí)工作的功耗。另外,在片上緩存不能存放一個(gè)程序的所有工作集時(shí),可以利用VSCP選擇部分具有強(qiáng)局部性的數(shù)據(jù)集駐留緩存以確保這些數(shù)據(jù)不被替換或污染,降低緩存缺失率并最終加速程序。 通過對本課題的研究,我們得到了以下重要認(rèn)識: (1)訪存性能對于單個(gè)程序以及系統(tǒng)整體性能都非常重要,在“存儲墻”現(xiàn)象日益嚴(yán)重的背景下,對于提升單個(gè)程序以及系統(tǒng)整體性能來說,降低緩存缺失率比減少執(zhí)行指令數(shù)都要更加有效。 (2)現(xiàn)有的緩存管理策略(包括操作系統(tǒng)任務(wù)調(diào)度和緩存替換策略的實(shí)現(xiàn))都無法感知到線程間緩存競爭與共享關(guān)系的存在,導(dǎo)致低效的緩存管理。緩存資源管理必須實(shí)現(xiàn)線程感知的策略,否則無法為系統(tǒng)性能、公平性和服務(wù)質(zhì)量等指標(biāo)提供支持。 (3)解決多核處理器緩存資源管理最終還是需要軟硬件協(xié)同配合才能完成,這需要對程序運(yùn)行時(shí)和緩存管理器之間的接口進(jìn)行重新設(shè)計(jì),包括建立更好的性能監(jiān)測基礎(chǔ)設(shè)施(軟、硬件)以便觀察系統(tǒng)內(nèi)部運(yùn)行時(shí)情況,以及細(xì)粒度的緩存資源分配機(jī)制,這些問題的解決需要操作系統(tǒng)設(shè)計(jì)者、硬件架構(gòu)師和程序開發(fā)人員的共同努力。 本文針對緩存資源管理而提出的關(guān)鍵問題解決方案,都是基于真實(shí)硬件平臺進(jìn)行設(shè)計(jì)實(shí)現(xiàn)的,是相對實(shí)際的解決方法,并且這些實(shí)現(xiàn)方案具有一般通用性,可以為未來處理器體系結(jié)構(gòu)上的緩存資源管理機(jī)制的實(shí)現(xiàn)提供參考。
[Abstract]:How to make use of on-chip cache is an important subject of multi-core processor research. the existing on-chip cache management mechanism is software transparent, does not perceive the locality features of the program data set in real time, and different access requests from multiple threads. On the one hand, when a plurality of threads are running on the multi-core processor at the same time, the existing cache management strategy can not only ensure the running performance of each task, but also can lead to unpredictable cache competition among a plurality of tasks of the shared cache, so that mutual interference is formed, and the throughput of the system is reduced. On the other hand, because the software is not able to control the allocation of the cache space, the hardware is managed only, so that the utilization efficiency of the program to the cache is not high, and in particular for a single-thread program, the performance acceleration cannot be obtained by using the abundant on-chip cache resources of the multi-core processor. In view of the above problems, this paper studies how to use the hardware performance monitoring unit to monitor the memory feature information in the program running in real time, to realize the shared cache competition management during the multi-thread operation, and to separate the cache space when the single thread is running. so that the throughput and the performance stability of the multi-task system can be improved, and a high-efficiency cache control hand is provided for the single-thread program execution. The contents of the study and the main work achievements of this paper include the following Surface: (1) It studies the performance monitor that can be used to detect the storage characteristics in real-time sense program. The mechanism is a low-cost-to-store performance monitoring scheme based on the performance monitoring unit. The IWM can provide the function of accessing the storage performance information when the user layer is running, and provide the system-level resource usage information to the cache manager, so as to reduce the monitoring and storage performance monitoring. At the cost of the implementation, we add performance event members to each task structure, provide a system call interface for event configuration, and count the error count that occurs during counter overflow and context switching In addition, we have optimized the time-division multiplexing mechanism of the performance counter to improve the accuracy of the event monitoring and the performance counter in the multi-event monitoring process. (2) The competition of shared cache resources by multiple tasks is studied. The concept of storage load is proposed and a distributed load balancing scheduling algorithm is designed to improve the throughput and program of multi-task system. Performance stability is presented in this paper. A distributed load balancing and scheduling technique is proposed to solve the problem of multi-task sharing. In this paper, the load balance scheduling algorithm is used to calculate the load balance scheduling algorithm, which can be used as the load balance of the operating system. The extension of the system. As this paper realizes the load-balanced scheduling as a user-level load-scheduling system, it is not necessary for the operating system The kernel is modified. After the experiment is compared with the other scheduling algorithms, it is proved that the distributed load balance scheduling algorithm proposed in this paper has a great improvement in the program-weighted acceleration and the overall throughput of the system, and reduces the multi-task to the shared cache. The competitive strength of the system is reduced, and the overall sheet-to-sheet visit of the system is reduced. Due to the stability of the algorithm, the distributed load balance scheduling reduces the performance difference between the multiple runs of the program, and can realize a fair and reliable task scheduling calculation for the operating system. (3) The problem that the cache space utilization rate is not high when the single-thread program runs on the multi-core processor platform is studied, a new cache control mechanism VSCP is proposed, and the cache utilization rate of the single-thread program is improved. The new cache control method (VSCP) proposed in this paper can effectively improve the utilization rate of the cache space on the multi-core processor chip by the single-thread program. The VSCP combines the cache resources on the whole system and provides the programmer with the explicit formula the cache control interface, the physically distributed cache space is virtualized to a user, Centralized caching of control. In parallel to the use of programs to maximize the use of computing resources, the VSCP attempts to maximize the delay the utilization rate of the storage resources. The VSCP maintains the state of only one processor core for a period of time, reducing the multi-core, In addition, when the on-chip cache is not able to store all worksets of a program, the VSCP selection section can be used to select a data set that has a strong locality to reside in the cache to ensure that these data are not replaced or contaminated, reducing the cache miss rate and finally accelerate the program. Through the research of the subject, we The following important recognition is obtained: (1) The storage performance is very important for the individual program and the overall performance of the system, and in the background of the increasing 鈥渟torage wall鈥,
本文編號:2370152
[Abstract]:How to make use of on-chip cache is an important subject of multi-core processor research. the existing on-chip cache management mechanism is software transparent, does not perceive the locality features of the program data set in real time, and different access requests from multiple threads. On the one hand, when a plurality of threads are running on the multi-core processor at the same time, the existing cache management strategy can not only ensure the running performance of each task, but also can lead to unpredictable cache competition among a plurality of tasks of the shared cache, so that mutual interference is formed, and the throughput of the system is reduced. On the other hand, because the software is not able to control the allocation of the cache space, the hardware is managed only, so that the utilization efficiency of the program to the cache is not high, and in particular for a single-thread program, the performance acceleration cannot be obtained by using the abundant on-chip cache resources of the multi-core processor. In view of the above problems, this paper studies how to use the hardware performance monitoring unit to monitor the memory feature information in the program running in real time, to realize the shared cache competition management during the multi-thread operation, and to separate the cache space when the single thread is running. so that the throughput and the performance stability of the multi-task system can be improved, and a high-efficiency cache control hand is provided for the single-thread program execution. The contents of the study and the main work achievements of this paper include the following Surface: (1) It studies the performance monitor that can be used to detect the storage characteristics in real-time sense program. The mechanism is a low-cost-to-store performance monitoring scheme based on the performance monitoring unit. The IWM can provide the function of accessing the storage performance information when the user layer is running, and provide the system-level resource usage information to the cache manager, so as to reduce the monitoring and storage performance monitoring. At the cost of the implementation, we add performance event members to each task structure, provide a system call interface for event configuration, and count the error count that occurs during counter overflow and context switching In addition, we have optimized the time-division multiplexing mechanism of the performance counter to improve the accuracy of the event monitoring and the performance counter in the multi-event monitoring process. (2) The competition of shared cache resources by multiple tasks is studied. The concept of storage load is proposed and a distributed load balancing scheduling algorithm is designed to improve the throughput and program of multi-task system. Performance stability is presented in this paper. A distributed load balancing and scheduling technique is proposed to solve the problem of multi-task sharing. In this paper, the load balance scheduling algorithm is used to calculate the load balance scheduling algorithm, which can be used as the load balance of the operating system. The extension of the system. As this paper realizes the load-balanced scheduling as a user-level load-scheduling system, it is not necessary for the operating system The kernel is modified. After the experiment is compared with the other scheduling algorithms, it is proved that the distributed load balance scheduling algorithm proposed in this paper has a great improvement in the program-weighted acceleration and the overall throughput of the system, and reduces the multi-task to the shared cache. The competitive strength of the system is reduced, and the overall sheet-to-sheet visit of the system is reduced. Due to the stability of the algorithm, the distributed load balance scheduling reduces the performance difference between the multiple runs of the program, and can realize a fair and reliable task scheduling calculation for the operating system. (3) The problem that the cache space utilization rate is not high when the single-thread program runs on the multi-core processor platform is studied, a new cache control mechanism VSCP is proposed, and the cache utilization rate of the single-thread program is improved. The new cache control method (VSCP) proposed in this paper can effectively improve the utilization rate of the cache space on the multi-core processor chip by the single-thread program. The VSCP combines the cache resources on the whole system and provides the programmer with the explicit formula the cache control interface, the physically distributed cache space is virtualized to a user, Centralized caching of control. In parallel to the use of programs to maximize the use of computing resources, the VSCP attempts to maximize the delay the utilization rate of the storage resources. The VSCP maintains the state of only one processor core for a period of time, reducing the multi-core, In addition, when the on-chip cache is not able to store all worksets of a program, the VSCP selection section can be used to select a data set that has a strong locality to reside in the cache to ensure that these data are not replaced or contaminated, reducing the cache miss rate and finally accelerate the program. Through the research of the subject, we The following important recognition is obtained: (1) The storage performance is very important for the individual program and the overall performance of the system, and in the background of the increasing 鈥渟torage wall鈥,
本文編號:2370152
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2370152.html
最近更新
教材專著