當(dāng)前位置：主頁(yè) > 科技論文 > 計(jì)算機(jī)論文 >

面向向量處理器的QR分解算法設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-05-06 09:08

本文選題：QR分解 + 向量化��；參考：《國(guó)防科學(xué)技術(shù)大學(xué)》2015年碩士論文

【摘要】：QR分解算法作為數(shù)字信號(hào)處理的主要工具,在高性能計(jì)算領(lǐng)域中扮演著重要的角色,是衡量處理器性能的重要指標(biāo)。QR分解在解決最小二乘問(wèn)題時(shí)非常有效,研究QR分解算法對(duì)發(fā)揮多核向量處理器的并行處理性能具有重要意義。針對(duì)Matrix的向量體系結(jié)構(gòu)的特點(diǎn),研究高效的QR分解向量化設(shè)計(jì)與實(shí)現(xiàn)方法具有重要的理論意義和應(yīng)用價(jià)值。本文深入分析了QR分解的三種算法的向量化方法,對(duì)Matrix的向量體系結(jié)構(gòu)中融合指令的優(yōu)化,成功設(shè)計(jì)并實(shí)現(xiàn)了Givens旋轉(zhuǎn),Gram-schmidt正交化,Householder變換三種算法大規(guī)模數(shù)據(jù)單核匯編程序。本文主要研究工作包括以下幾個(gè)方面:(1)設(shè)計(jì)和實(shí)現(xiàn)了基于Matrix單核的Givens旋轉(zhuǎn)算法程序。利用標(biāo)向量共享寄存器從而減少了DDR到SRAM的數(shù)據(jù)傳輸;設(shè)計(jì)了軟件流水實(shí)現(xiàn)方法并采用手工匯編對(duì)程序進(jìn)行優(yōu)化;詳細(xì)分析了其數(shù)據(jù)排布要求,對(duì)數(shù)據(jù)初始存儲(chǔ)進(jìn)行偏移從而有效減少了AM_Bsy;設(shè)計(jì)雙緩沖DMA數(shù)據(jù)搬移策略,將數(shù)據(jù)傳輸時(shí)間和數(shù)據(jù)計(jì)算時(shí)間重跌,從而提升程序性能。試驗(yàn)結(jié)果表明:相比基于TI公司的TMS320C6713平臺(tái)經(jīng)優(yōu)化的C語(yǔ)言,對(duì)于不同規(guī)模雙精度Givens的平均加速比為74.33。對(duì)于2048規(guī)模的矩陣計(jì)算性能達(dá)到74.77%。(2)設(shè)計(jì)和實(shí)現(xiàn)了基于Matrix單核的Gram-schmidt正交化算法程序。通過(guò)對(duì)傳統(tǒng)Gram-schmidt正交化方法進(jìn)行改進(jìn),使得其更加適合Matrix向量處理器的結(jié)構(gòu)特點(diǎn)。設(shè)計(jì)了軟件流水實(shí)現(xiàn)方法并采用手工匯編對(duì)程序進(jìn)行優(yōu)化,詳細(xì)分析了其數(shù)據(jù)排布要求和確定了最小迭代間隔。設(shè)計(jì)雙緩沖DMA數(shù)據(jù)搬移策略,將數(shù)據(jù)傳輸時(shí)間和數(shù)據(jù)計(jì)算時(shí)間重跌,使算法計(jì)算效率提高。試驗(yàn)結(jié)果表明:相比基于TI公司的TMS320C6713平臺(tái)經(jīng)優(yōu)化的C語(yǔ)言,對(duì)于不同規(guī)模雙精度Gram-schmidt正交化的平均加速比為83.26。對(duì)于2048規(guī)模的矩陣計(jì)算性能達(dá)到46.31%。(3)設(shè)計(jì)和實(shí)現(xiàn)了基于Matrix單核的Householder變換算法程序。詳細(xì)分析了大規(guī)模數(shù)據(jù)Householder變換基本原理和算法流程,通過(guò)對(duì)兩種矩陣乘法的分析,選擇了更適合Matrix向量處理器的方法;實(shí)現(xiàn)了Householder矩陣求值方法的向量化;優(yōu)化設(shè)計(jì)了基于DMA雙緩沖搬移計(jì)算的單核Householder變換程序;設(shè)計(jì)雙緩沖DMA數(shù)據(jù)搬移策略,將數(shù)據(jù)傳輸時(shí)間和數(shù)據(jù)計(jì)算時(shí)間重跌。實(shí)驗(yàn)結(jié)果表明:相比基于TI公司的TMS320C6713平臺(tái)經(jīng)優(yōu)化的C語(yǔ)言,對(duì)于不同規(guī)模雙精度Householder變換的平均加速比為95.76。對(duì)于1920規(guī)模的矩陣計(jì)算性能達(dá)到83.64%
[Abstract]:As the main tool of digital signal processing, QR decomposition algorithm plays an important role in the field of high performance computing. QR decomposition is an important index to measure processor performance. QR decomposition is very effective in solving the least square problem. It is very important to study QR decomposition algorithm to give full play to the parallel processing performance of multi-core vector processor. According to the characteristics of vector architecture of Matrix, it is of great theoretical significance and practical value to study the efficient design and implementation of QR decomposition vectorization. This paper analyzes the vectorization methods of three QR decomposition algorithms, optimizes the fusion instructions in the vector architecture of Matrix, and successfully designs and implements the large scale data single core assembler program of the three algorithms of Givens rotation Gram-Schmidt orthogonalization and Householder transformation. The main research work of this paper includes the following aspects: design and implement the Givens rotation algorithm program based on Matrix single core. The data transmission from DDR to SRAM is reduced by using the scalar vector shared register. The software pipelining implementation method is designed and the program is optimized by manual assembly. The data initial storage is offset to reduce the Ampis Bsys effectively, and the double buffer DMA data transfer strategy is designed to reduce the data transmission time and data computation time again, thus improving the performance of the program. The experimental results show that compared with the optimized C language for the TMS320C6713 platform based on TI, the average speedup ratio for Givens with different scales is 74.33. For the matrix of 2048 scale, the algorithm program of Gram-schmidt orthogonalization based on Matrix single core is designed and implemented. The traditional Gram-schmidt orthogonalization method is improved to make it more suitable for the Matrix vector processor architecture. The software pipelining implementation method is designed and the program is optimized by manual assembly. The data scheduling requirements and the minimum iteration interval are analyzed in detail. A double buffer DMA data transfer strategy is designed to reduce the data transmission time and the data computing time so as to improve the computational efficiency of the algorithm. The experimental results show that compared with the optimized C language of the TMS320C6713 platform based on TI, the average speedup ratio for the Gram-schmidt orthogonalization with different scales is 83.26. For the matrix of 2048 scale, the algorithm program of Householder transform based on Matrix single core is designed and implemented. The basic principle and algorithm flow of large-scale data Householder transform are analyzed in detail. Through the analysis of two kinds of matrix multiplication, the method that is more suitable for Matrix vector processor is selected, and the vectorization of Householder matrix evaluation method is realized. The single core Householder transform program based on DMA double buffer shift calculation is designed, and the double buffer DMA data transfer strategy is designed to reduce the data transmission time and data calculation time again. The experimental results show that compared with the optimized C language based on TI's TMS320C6713 platform, the average speedup of Householder transform with different scales is 95.76. Performance of 83.64% for 1920 matrix
【學(xué)位授予單位】：國(guó)防科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2015
【分類號(hào)】：TP332

【參考文獻(xiàn)】

相關(guān)期刊論文前5條

1 朱勇旭;吳斌;周玉梅;蔡菁菁;夏凱鋒;;用于MIMO-OFDM系統(tǒng)QR分解的分布式脈動(dòng)陣列處理算法[J];電子與信息學(xué)報(bào);2012年08期

2 劉擁軍;胡捍英;;GPS頻域抗干擾算法研究[J];信號(hào)處理;2009年11期

3 曾操;廖桂生;楊志偉;;一種加載量迭代搜索的穩(wěn)健波束形成[J];電波科學(xué)學(xué)報(bào);2007年05期

4 楊劍煒;尹成友;廖飛龍;;GPS接收陣列中幾種自適應(yīng)算法的性能分析[J];電子信息對(duì)抗技術(shù);2006年06期

5 沈嘉;;3GPP LTE核心技術(shù)及標(biāo)準(zhǔn)化進(jìn)展[J];移動(dòng)通信;2006年04期

，

本文編號(hào)：1851695

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1851695.html

上一篇：基于Android平臺(tái)的電阻式觸摸屏校準(zhǔn)算法的研究與實(shí)現(xiàn)
下一篇：基于模擬退火算法優(yōu)化BP神經(jīng)網(wǎng)絡(luò)的色彩空間轉(zhuǎn)換

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向向量處理器的QR分解算法設(shè)計(jì)與實(shí)現(xiàn)