天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 自動化論文 >

部分觀測馬爾科夫決策過程中基于記憶的強化學習問題研究

發(fā)布時間:2018-02-11 07:48

  本文關鍵詞: 強化學習 U-Tree算法 Sarsa(λ)算法 Q-學習算法 部分觀測馬爾科夫決策過程 出處:《天津工業(yè)大學》2017年碩士論文 論文類型:學位論文


【摘要】:在強化學習中,Agent對環(huán)境做出動作并從環(huán)境得到回報,相應于不同的動作,環(huán)境給予的回報值有所不同,通過對到達目標點所作一系列動作的回報值不斷強化,Agent能夠?qū)W到從內(nèi)部狀態(tài)到動作的映射,即學到?jīng)Q策過程。傳統(tǒng)的U-Tree算法在解決部分觀測馬爾科夫決策過程(partially observable Markov decision processes,POMDP)的強化學習問題方面已經(jīng)取得了顯著的成效,但因為邊緣結點生長的隨意性,仍然存在樹的規(guī)模龐大,內(nèi)存需求較大,計算復雜度過高的問題。本文在原有U-Tree算法的基礎上進行改進,通過獲取下一步的觀測值,對同一葉結點中做相同動作的實例進行劃分,提出了一種基于有效實例擴展邊緣結點的(EffectiveInstance U-Tree)算法,簡稱為EIU-Tree算法。大大縮減了計算規(guī)模,從而可以幫助agent更快更好地學習,并在經(jīng)典的4×3柵格問題中進行了仿真實驗,實驗表明該算法相對于原有的U-Tree算法有更好的效果。針對U-Tree算法和MU-Tree算法中收斂速度慢的問題,本文中在agent做值迭代的時候,我們用Sarsa(λ)算法更新Q值,提出了一種基于Sarsa(λ)算法的(Sarsa(λ)U-Tree)算法,簡稱為SU-Tree算法。當agent到達目標狀態(tài)或懲罰狀態(tài)時,會對這條路徑上所有產(chǎn)生的實例進行Q值的更新,提高了算法的收斂速度。并在4X3方格問題和奶酪迷宮問題中進行了仿真實驗,實驗表明該算法相對于原有的U-Tree算法和MU-Tree算法,Agent可以更快地找到起點到終點的無震蕩路徑。
[Abstract]:In reinforcement learning, agents act on the environment and get the return from the environment. According to the different actions, the return value of the environment is different. By continuously reinforcing the return value of a series of actions to the target point, the Agent can learn the mapping from the internal state to the action. The traditional U-Tree algorithm has achieved remarkable results in solving the reinforcement learning problem of partially observing Markov observable Markov decision processes (POMDP), but because of the random growth of edge nodes, There are still the problems of large scale of tree, large memory requirement and high computational complexity. This paper improves on the original U-Tree algorithm and obtains the observation value of the next step. In this paper, a new algorithm called EIU-Tree algorithm based on effective instance to extend edge nodes is proposed, which can greatly reduce the calculation scale and help agent learn more quickly and better, by dividing the cases that do the same action in the same leaf node, and the algorithm is called EIU-Tree algorithm, which is based on extending the edge node of the effective instance, and the algorithm is called EIU-Tree algorithm for short, which greatly reduces the calculation scale and can help agent learn better and faster. The simulation results in the classical 4 脳 3 grid problem show that the algorithm is more effective than the original U-Tree algorithm. For the problem of slow convergence in U-Tree algorithm and MU-Tree algorithm, this paper makes a value iteration in agent. In this paper, we use Sarsa (位) algorithm to update Q value, and propose a Sarsa (位) algorithm based on Sarsa (位) algorithm, which is called SU-Tree algorithm for short. When agent reaches the target state or punishment state, it will update the Q value of all instances generated on this path. Simulation experiments on 4X3 lattice problem and cheese maze problem show that compared with the original U-Tree algorithm and MU-Tree algorithm, the algorithm can find the non-oscillatory path from the starting point to the end point more quickly than the original U-Tree algorithm and the MU-Tree algorithm.
【學位授予單位】:天津工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:O225;TP18

【參考文獻】

相關期刊論文 前6條

1 彭志平;柯文德;;一種連續(xù)U-樹抽象狀態(tài)最佳分裂點選取方法[J];上海交通大學學報;2008年02期

2 殷萇茗,王漢興,陳煥文,謝麗娟;求解POMDP的動態(tài)合并激勵學習算法[J];計算機工程;2005年22期

3 王學寧,賀漢根,徐昕;求解部分可觀測馬氏決策過程的強化學習算法[J];控制與決策;2004年11期

4 謝麗娟,陳煥文;部分可觀測Markov環(huán)境下的激勵學習綜述[J];長沙電力學院學報(自然科學版);2002年02期

5 張波,蔡慶生,郭百寧;口語對話系統(tǒng)的POMDP模型及求解[J];計算機研究與發(fā)展;2002年02期

6 陳煥文,謝麗娟;平均獎賞MDP的在策略無模型激勵學習算法[J];計算機工程與科學;2001年02期

,

本文編號:1502543

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1502543.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶9d8f8***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com