當(dāng)前位置：主頁(yè) > 科技論文 > 自動(dòng)化論文 >

連續(xù)空間中的一種動(dòng)作加權(quán)行動(dòng)者評(píng)論家算法

發(fā)布時(shí)間：2018-01-29 08:29

本文關(guān)鍵詞： 強(qiáng)化學(xué)習(xí) 連續(xù)空間函數(shù)逼近行動(dòng)者評(píng)論家梯度下降人工智能　出處：《計(jì)算機(jī)學(xué)報(bào)》2017年06期 　論文類(lèi)型：期刊論文

【摘要】：經(jīng)典的強(qiáng)化學(xué)習(xí)算法主要應(yīng)用于離散狀態(tài)動(dòng)作空間中.在復(fù)雜的學(xué)習(xí)環(huán)境下,離散空間的強(qiáng)化學(xué)習(xí)方法不能很好地滿(mǎn)足實(shí)際需求,而常用的連續(xù)空間的方法最優(yōu)策略的震蕩幅度較大.針對(duì)連續(xù)空間下具有區(qū)間約束的連續(xù)動(dòng)作空間的最優(yōu)控制問(wèn)題,提出了一種動(dòng)作加權(quán)的行動(dòng)者評(píng)論家算法(Action Weight Policy Search Actor Critic,AW-PS-AC).AW-PS-AC算法以行動(dòng)者評(píng)論家為基本框架,對(duì)最優(yōu)狀態(tài)值函數(shù)和最優(yōu)策略使用線(xiàn)性函數(shù)逼近器進(jìn)行近似,通過(guò)梯度下降方法對(duì)一組值函數(shù)參數(shù)和兩組策略參數(shù)進(jìn)行更新.對(duì)兩組策略參數(shù)進(jìn)行加權(quán)獲得最優(yōu)策略,并對(duì)獲得的最優(yōu)動(dòng)作通過(guò)區(qū)間進(jìn)行約束,以防止動(dòng)作越界.為了進(jìn)一步提高算法的收斂速度,設(shè)計(jì)了一種改進(jìn)的時(shí)間差分算法,即采用值函數(shù)的時(shí)間差分誤差來(lái)更新最優(yōu)策略,并引入了策略資格跡調(diào)整策略參數(shù).為了證明算法的收斂性,在指定的假設(shè)條件下對(duì)AW-PS-AC算法的收斂性進(jìn)行了分析.為了驗(yàn)證AW-PS-AC算法的有效性,在平衡桿和水洼世界實(shí)驗(yàn)中對(duì)AW-PS-AC算法進(jìn)行仿真.實(shí)驗(yàn)結(jié)果表明AW-PS-AC算法在兩個(gè)實(shí)驗(yàn)中均能有效求解連續(xù)空間中近似最優(yōu)策略問(wèn)題,并且與經(jīng)典的連續(xù)動(dòng)作空間算法相比,該算法具有收斂速度快和穩(wěn)定性高的優(yōu)點(diǎn).
[Abstract]:The classical reinforcement learning algorithm is mainly used in discrete state action space. In the complex learning environment, the reinforcement learning method in discrete space can not meet the actual needs. However, the usual method of continuous space has a large amplitude of oscillation. The optimal control problem of continuous action space with interval constraints in continuous space is discussed. This paper presents an actor-weighted actor-critic algorithm named Action Weight Policy Search Actor Critic. The AW-PS-AC).AW-PS-AC algorithm takes the actor critic as the basic frame and approximates the optimal state value function and the optimal strategy using the linear function approximator. One set of value function parameters and two groups of policy parameters are updated by gradient descent method. The optimal strategy is obtained by weighting the two groups of policy parameters, and the obtained optimal actions are constrained through the interval. In order to prevent the action from crossing the boundary. In order to further improve the convergence speed of the algorithm, an improved time-difference division algorithm is designed, that is, the time-difference error of the value function is used to update the optimal strategy. The policy parameters are introduced to prove the convergence of the algorithm. The convergence of AW-PS-AC algorithm is analyzed under the specified assumptions. In order to verify the validity of AW-PS-AC algorithm. The AW-PS-AC algorithm is simulated in the balance bar and water pool world experiments. The experimental results show that the AW-PS-AC algorithm can effectively solve the approximate optimal strategy problem in the continuous space in both experiments. Compared with the classical continuous action space algorithm, this algorithm has the advantages of fast convergence and high stability.
【作者單位】：蘇州大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院;軟件新技術(shù)與產(chǎn)業(yè)化協(xié)同創(chuàng)新中心;吉林大學(xué)符號(hào)計(jì)算與知識(shí)工程教育部重點(diǎn)實(shí)驗(yàn)室;
【基金】：國(guó)家自然科學(xué)基金(61472262,61502323,61502329) 江蘇省自然科學(xué)基金(BK2012616) 江蘇省高校自然科學(xué)研究項(xiàng)目(13KJB520020) 吉林大學(xué)符號(hào)計(jì)算與知識(shí)工程教育部重點(diǎn)實(shí)驗(yàn)室基金項(xiàng)目(93K172014K04) 蘇州市應(yīng)用基礎(chǔ)研究計(jì)劃工業(yè)部分(SYG201422,SYG201308)資助~~
【分類(lèi)號(hào)】：TP18
【正文快照】： 金(BK2012616)、江蘇省高校自然科學(xué)研究項(xiàng)目(13KJB520020)、吉林大學(xué)符號(hào)計(jì)算與知識(shí)工程教育部重點(diǎn)實(shí)驗(yàn)室基金項(xiàng)目(93K172014K04)、蘇州市應(yīng)用基礎(chǔ)研究計(jì)劃工業(yè)部分(SYG201422,SYG201308)資助.劉全,男,1969年生,博士,教授,博士生導(dǎo)師,中國(guó)計(jì)算機(jī)學(xué)會(huì)(CCF)高級(jí)會(huì)員,主要研究領(lǐng)

【相似文獻(xiàn)】

相關(guān)期刊論文前5條

1 汪鐳,吳啟迪;蟻群算法在連續(xù)空間尋優(yōu)問(wèn)題求解中的應(yīng)用[J];控制與決策;2003年01期

2 劉喜恩;;用于連續(xù)空間尋優(yōu)的一種蟻群算法[J];計(jì)算機(jī)應(yīng)用;2009年10期

3 李向麗;楊慧中;魏麗霞;;基于退火的蟻群算法在連續(xù)空間優(yōu)化中的應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2007年23期

4 程玉虎;王雪松;易建強(qiáng);孫偉;;基于自組織模糊RBF網(wǎng)絡(luò)的連續(xù)空間Q學(xué)習(xí)[J];信息與控制;2008年01期

5 ;[J];;年期

相關(guān)碩士學(xué)位論文前1條

1 張鵬程;基于核的連續(xù)空間增強(qiáng)學(xué)習(xí)方法及應(yīng)用研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2009年

，

本文編號(hào)：1473001

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1473001.html

上一篇：基于紅外吸收光譜的設(shè)施園藝二氧化碳檢測(cè)系統(tǒng)的研究
下一篇：基于WA-ABC-WLSSVR的南美白對(duì)蝦工廠(chǎng)化育苗溶解氧預(yù)測(cè)模型

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

連續(xù)空間中的一種動(dòng)作加權(quán)行動(dòng)者評(píng)論家算法