部分觀測(cè)馬爾科夫決策過(guò)程中基于記憶的強(qiáng)化學(xué)習(xí)問(wèn)題研究
本文關(guān)鍵詞: 強(qiáng)化學(xué)習(xí) U-Tree算法 Sarsa(λ)算法 Q-學(xué)習(xí)算法 部分觀測(cè)馬爾科夫決策過(guò)程 出處:《天津工業(yè)大學(xué)》2017年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:在強(qiáng)化學(xué)習(xí)中,Agent對(duì)環(huán)境做出動(dòng)作并從環(huán)境得到回報(bào),相應(yīng)于不同的動(dòng)作,環(huán)境給予的回報(bào)值有所不同,通過(guò)對(duì)到達(dá)目標(biāo)點(diǎn)所作一系列動(dòng)作的回報(bào)值不斷強(qiáng)化,Agent能夠?qū)W到從內(nèi)部狀態(tài)到動(dòng)作的映射,即學(xué)到?jīng)Q策過(guò)程。傳統(tǒng)的U-Tree算法在解決部分觀測(cè)馬爾科夫決策過(guò)程(partially observable Markov decision processes,POMDP)的強(qiáng)化學(xué)習(xí)問(wèn)題方面已經(jīng)取得了顯著的成效,但因?yàn)檫吘壗Y(jié)點(diǎn)生長(zhǎng)的隨意性,仍然存在樹(shù)的規(guī)模龐大,內(nèi)存需求較大,計(jì)算復(fù)雜度過(guò)高的問(wèn)題。本文在原有U-Tree算法的基礎(chǔ)上進(jìn)行改進(jìn),通過(guò)獲取下一步的觀測(cè)值,對(duì)同一葉結(jié)點(diǎn)中做相同動(dòng)作的實(shí)例進(jìn)行劃分,提出了一種基于有效實(shí)例擴(kuò)展邊緣結(jié)點(diǎn)的(EffectiveInstance U-Tree)算法,簡(jiǎn)稱(chēng)為EIU-Tree算法。大大縮減了計(jì)算規(guī)模,從而可以幫助agent更快更好地學(xué)習(xí),并在經(jīng)典的4×3柵格問(wèn)題中進(jìn)行了仿真實(shí)驗(yàn),實(shí)驗(yàn)表明該算法相對(duì)于原有的U-Tree算法有更好的效果。針對(duì)U-Tree算法和MU-Tree算法中收斂速度慢的問(wèn)題,本文中在agent做值迭代的時(shí)候,我們用Sarsa(λ)算法更新Q值,提出了一種基于Sarsa(λ)算法的(Sarsa(λ)U-Tree)算法,簡(jiǎn)稱(chēng)為SU-Tree算法。當(dāng)agent到達(dá)目標(biāo)狀態(tài)或懲罰狀態(tài)時(shí),會(huì)對(duì)這條路徑上所有產(chǎn)生的實(shí)例進(jìn)行Q值的更新,提高了算法的收斂速度。并在4X3方格問(wèn)題和奶酪迷宮問(wèn)題中進(jìn)行了仿真實(shí)驗(yàn),實(shí)驗(yàn)表明該算法相對(duì)于原有的U-Tree算法和MU-Tree算法,Agent可以更快地找到起點(diǎn)到終點(diǎn)的無(wú)震蕩路徑。
[Abstract]:In reinforcement learning, agents act on the environment and get the return from the environment. According to the different actions, the return value of the environment is different. By continuously reinforcing the return value of a series of actions to the target point, the Agent can learn the mapping from the internal state to the action. The traditional U-Tree algorithm has achieved remarkable results in solving the reinforcement learning problem of partially observing Markov observable Markov decision processes (POMDP), but because of the random growth of edge nodes, There are still the problems of large scale of tree, large memory requirement and high computational complexity. This paper improves on the original U-Tree algorithm and obtains the observation value of the next step. In this paper, a new algorithm called EIU-Tree algorithm based on effective instance to extend edge nodes is proposed, which can greatly reduce the calculation scale and help agent learn more quickly and better, by dividing the cases that do the same action in the same leaf node, and the algorithm is called EIU-Tree algorithm, which is based on extending the edge node of the effective instance, and the algorithm is called EIU-Tree algorithm for short, which greatly reduces the calculation scale and can help agent learn better and faster. The simulation results in the classical 4 脳 3 grid problem show that the algorithm is more effective than the original U-Tree algorithm. For the problem of slow convergence in U-Tree algorithm and MU-Tree algorithm, this paper makes a value iteration in agent. In this paper, we use Sarsa (位) algorithm to update Q value, and propose a Sarsa (位) algorithm based on Sarsa (位) algorithm, which is called SU-Tree algorithm for short. When agent reaches the target state or punishment state, it will update the Q value of all instances generated on this path. Simulation experiments on 4X3 lattice problem and cheese maze problem show that compared with the original U-Tree algorithm and MU-Tree algorithm, the algorithm can find the non-oscillatory path from the starting point to the end point more quickly than the original U-Tree algorithm and the MU-Tree algorithm.
【學(xué)位授予單位】:天津工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:O225;TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 彭志平;柯文德;;一種連續(xù)U-樹(shù)抽象狀態(tài)最佳分裂點(diǎn)選取方法[J];上海交通大學(xué)學(xué)報(bào);2008年02期
2 殷萇茗,王漢興,陳煥文,謝麗娟;求解POMDP的動(dòng)態(tài)合并激勵(lì)學(xué)習(xí)算法[J];計(jì)算機(jī)工程;2005年22期
3 王學(xué)寧,賀漢根,徐昕;求解部分可觀測(cè)馬氏決策過(guò)程的強(qiáng)化學(xué)習(xí)算法[J];控制與決策;2004年11期
4 謝麗娟,陳煥文;部分可觀測(cè)Markov環(huán)境下的激勵(lì)學(xué)習(xí)綜述[J];長(zhǎng)沙電力學(xué)院學(xué)報(bào)(自然科學(xué)版);2002年02期
5 張波,蔡慶生,郭百寧;口語(yǔ)對(duì)話(huà)系統(tǒng)的POMDP模型及求解[J];計(jì)算機(jī)研究與發(fā)展;2002年02期
6 陳煥文,謝麗娟;平均獎(jiǎng)賞MDP的在策略無(wú)模型激勵(lì)學(xué)習(xí)算法[J];計(jì)算機(jī)工程與科學(xué);2001年02期
,本文編號(hào):1502543
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1502543.html