基于狀態(tài)聚類的非參數(shù)化近似廣義策略迭代增強(qiáng)學(xué)習(xí)算法
發(fā)布時(shí)間:2019-04-18 08:59
【摘要】:為解決當(dāng)前近似策略迭代增強(qiáng)學(xué)習(xí)算法普遍存在計(jì)算量大、基函數(shù)不能完全自動(dòng)構(gòu)建的問(wèn)題,提出一種基于狀態(tài)聚類的非參數(shù)化近似廣義策略迭代增強(qiáng)學(xué)習(xí)算法(NPAGPI-SC).該算法利用二級(jí)隨機(jī)采樣過(guò)程采集樣本,利用trial-and-error過(guò)程和以樣本完全覆蓋為目標(biāo)的估計(jì)方法計(jì)算逼近器初始參數(shù),利用delta規(guī)則和最近鄰思想在學(xué)習(xí)過(guò)程中自適應(yīng)地調(diào)整逼近器,利用貪心策略選擇應(yīng)執(zhí)行的動(dòng)作.一級(jí)倒立擺平衡控制的仿真實(shí)驗(yàn)結(jié)果驗(yàn)證了所提出算法的有效性和魯棒性.
[Abstract]:In order to solve the problem that the current approximate strategy iterative reinforcement learning algorithm has a large amount of computation and the basis function can not be constructed automatically, a nonparametric approximate generalized strategy iterative reinforcement learning algorithm (NPAGPI-SC) based on state clustering is proposed. In this algorithm, the two-stage random sampling process is used to collect samples, and the initial parameters of the approximator are calculated by using the trial-and-error process and the estimation method with the complete coverage of the sample as the target. The delta rule and the nearest neighbor idea are used to adjust the approximator adaptively in the learning process, and the greedy strategy is used to select the actions to be performed. The simulation results of the balance control of a single inverted pendulum verify the effectiveness and robustness of the proposed algorithm.
【作者單位】: 南昌大學(xué)江西省機(jī)器人與焊接自動(dòng)化重點(diǎn)實(shí)驗(yàn)室;
【基金】:國(guó)家863計(jì)劃項(xiàng)目(SS2013AA041003)
【分類號(hào)】:TP181
,
本文編號(hào):2459917
[Abstract]:In order to solve the problem that the current approximate strategy iterative reinforcement learning algorithm has a large amount of computation and the basis function can not be constructed automatically, a nonparametric approximate generalized strategy iterative reinforcement learning algorithm (NPAGPI-SC) based on state clustering is proposed. In this algorithm, the two-stage random sampling process is used to collect samples, and the initial parameters of the approximator are calculated by using the trial-and-error process and the estimation method with the complete coverage of the sample as the target. The delta rule and the nearest neighbor idea are used to adjust the approximator adaptively in the learning process, and the greedy strategy is used to select the actions to be performed. The simulation results of the balance control of a single inverted pendulum verify the effectiveness and robustness of the proposed algorithm.
【作者單位】: 南昌大學(xué)江西省機(jī)器人與焊接自動(dòng)化重點(diǎn)實(shí)驗(yàn)室;
【基金】:國(guó)家863計(jì)劃項(xiàng)目(SS2013AA041003)
【分類號(hào)】:TP181
,
本文編號(hào):2459917
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2459917.html
最近更新
教材專著