基于凸多面體抽象域的自適應(yīng)強(qiáng)化學(xué)習(xí)技術(shù)研究

發(fā)布時(shí)間：2018-12-27 14:57

【摘要】：表格驅(qū)動(dòng)的算法是解決強(qiáng)化學(xué)習(xí)問題的一類重要方法,但由于"維數(shù)災(zāi)"現(xiàn)象的存在,這種方法不能直接應(yīng)用于解決具有連續(xù)狀態(tài)空間的強(qiáng)化學(xué)習(xí)問題.解決維數(shù)災(zāi)問題的方法主要包括兩種:狀態(tài)空間的離散化和函數(shù)近似方法.相比函數(shù)近似,基于連續(xù)狀態(tài)空間離散化的表格驅(qū)動(dòng)方法具有原理直觀、程序結(jié)構(gòu)簡單和計(jì)算輕量化的特點(diǎn).基于連續(xù)狀態(tài)空間離散化方法的關(guān)鍵是發(fā)現(xiàn)合適的狀態(tài)空間離散化機(jī)制,平衡計(jì)算量及準(zhǔn)確性,并且確�；陔x散抽象狀態(tài)空間的數(shù)值性度量,例如V值函數(shù)和Q值函數(shù),可以較為準(zhǔn)確地對(duì)原始強(qiáng)化學(xué)習(xí)問題進(jìn)行策略評(píng)估和最優(yōu)策略π*計(jì)算.文中提出一種基于凸多面體抽象域的自適應(yīng)狀態(tài)空間離散化方法,實(shí)現(xiàn)自適應(yīng)的基于凸多面體抽象域的Q(λ)強(qiáng)化學(xué)習(xí)算法(Adaptive Polyhedra Domain based Q(λ),APDQ(λ)).凸多面體是一種抽象狀態(tài)的表達(dá)方法,廣泛應(yīng)用于各種隨機(jī)系統(tǒng)性能評(píng)估和程序數(shù)值性屬性的驗(yàn)證.這種方法通過抽象函數(shù),建立具體狀態(tài)空間至多面體域的抽象狀態(tài)空間的映射,把連續(xù)狀態(tài)空間最優(yōu)策略的計(jì)算問題轉(zhuǎn)化為有限大小的和易于處理的抽象狀態(tài)空間最優(yōu)策略的計(jì)算問題.根據(jù)與抽象狀態(tài)相關(guān)的樣本集信息,設(shè)計(jì)了包括BoxRefinement、LFRefinement和MVLFRefinement多種自適應(yīng)精化機(jī)制.依據(jù)這些精化機(jī)制,對(duì)抽象狀態(tài)空間持續(xù)進(jìn)行適應(yīng)性精化,從而優(yōu)化具體狀態(tài)空間的離散化機(jī)制,產(chǎn)生符合在線抽樣樣本空間所蘊(yùn)涵的統(tǒng)計(jì)獎(jiǎng)賞模型.基于多面體專業(yè)計(jì)算庫PPL(Parma Polyhedra Library)和高精度數(shù)值計(jì)算庫GMP(GNU Multiple Precision)實(shí)現(xiàn)了算法APDQ(λ),并實(shí)施了實(shí)例研究.選擇典型的連續(xù)狀態(tài)空間強(qiáng)化學(xué)習(xí)問題山地車(Mountain Car,MC)和雜技機(jī)器人(Acrobatic robot,Acrobot)作為實(shí)驗(yàn)對(duì)象,詳細(xì)評(píng)估了各種強(qiáng)化學(xué)習(xí)參數(shù)和自適應(yīng)精化相關(guān)的閾值參數(shù)對(duì)APDQ(λ)性能的影響,探究了抽象狀態(tài)空間動(dòng)態(tài)變化情況下各種參數(shù)在策略優(yōu)化過程中的作用機(jī)理.實(shí)驗(yàn)結(jié)果顯示當(dāng)折扣率γ大于0.7時(shí),算法展現(xiàn)出較好的綜合性能,在初期,策略都快速地改進(jìn),后面的階段平緩地趨向收斂(如圖6~圖13所示),并且對(duì)學(xué)習(xí)率α和各種抽象狀態(tài)空間精化參數(shù)都具有較好的適應(yīng)性;當(dāng)折扣率γ小于0.6時(shí),算法的性能衰退較快.抽象解釋技術(shù)用于統(tǒng)計(jì)學(xué)習(xí)過程是一種較好的解決連續(xù)強(qiáng)化學(xué)習(xí)問題的思想,有許多問題值得進(jìn)一步研究和探討,例如基于近似模型的采樣和值函數(shù)更新等問題.
[Abstract]:Table-driven algorithm is an important method to solve reinforcement learning problem. However, due to the existence of "dimensionality disaster", this method can not be directly applied to solve reinforcement learning problem with continuous state space. There are two methods to solve the problem of dimensionality disaster: discretization of state space and approximation of function. Compared with the function approximation, the table-driven method based on continuous state space discretization has the advantages of intuitive principle, simple program structure and lightweight calculation. The key of discretization method based on continuous state space is to find appropriate discretization mechanism of state space, balance computation and accuracy, and ensure numerical measures based on discrete abstract state space, such as V value function and Q value function. It is possible to evaluate the original reinforcement learning problem and calculate the optimal strategy 蟺 * accurately. In this paper, an adaptive state space discretization method based on convex polyhedron abstract domain is proposed. The adaptive Q (位) reinforcement learning algorithm (Adaptive Polyhedra Domain based Q (位), APDQ (位). Based on convex polyhedron abstract domain is implemented. Convex polyhedron is an abstract state representation method, which is widely used to evaluate the performance of random systems and verify the numerical properties of programs. The mapping of concrete state space to the abstract state space of polyhedron domain is established by abstract function. The computation problem of continuous state space optimal strategy is transformed into a finite size and easy to deal with the computation problem of abstract state space optimal policy. According to the sample set information related to abstract state, several adaptive refinement mechanisms including BoxRefinement,LFRefinement and MVLFRefinement are designed. According to these refinement mechanisms, the abstract state space is continuously refined adaptively to optimize the discretization mechanism of the specific state space, and to produce a statistical reward model consistent with the sample space of online sampling. The algorithm APDQ (位) is realized based on the polyhedron professional computing library PPL (Parma Polyhedra Library) and the high precision numerical calculation library GMP (GNU Multiple Precision), and a case study is carried out. The typical continuous state space reinforcement learning problem (Mountain Car,MC) and acrobatics robot (Acrobatic robot,Acrobot) were selected as experimental objects. The effects of various reinforcement learning parameters and threshold parameters related to adaptive refinement on the performance of APDQ (位) are evaluated in detail, and the mechanism of various parameters in the process of policy optimization under the dynamic change of abstract state space is explored. The experimental results show that when the discount rate 緯 is greater than 0.7, the algorithm shows good comprehensive performance. In the initial stage, the strategy is improved quickly, and the later stage converges gently (as shown in figs. 6 ~ 13). And it has good adaptability to learning rate 偽 and various abstract state space refinement parameters. When the discount rate 緯 is less than 0.6, the performance of the algorithm declines rapidly. Abstract interpretation technology used in statistical learning process is a good idea to solve the continuous reinforcement learning problem. There are many problems worthy of further study and discussion, such as sampling based on approximate model and value function updating and so on.
【作者單位】：蘇州大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院;符號(hào)計(jì)算與知識(shí)工程教育部重點(diǎn)實(shí)驗(yàn)室(吉林大學(xué));
【基金】：國家自然科學(xué)基金項(xiàng)目(61272005,61303108,61373094,61472262,61502323,61502329) 江蘇省自然科學(xué)基金項(xiàng)目(BK2012616) 江蘇省高校自然科學(xué)研究項(xiàng)目(13KJB520020) 吉林大學(xué)符號(hào)計(jì)算與知識(shí)工程教育部重點(diǎn)實(shí)驗(yàn)室項(xiàng)目(93K172014K04) 蘇州市應(yīng)用基礎(chǔ)研究計(jì)劃項(xiàng)目(SYG201422) 蘇州大學(xué)高校省級(jí)重點(diǎn)實(shí)驗(yàn)室基金項(xiàng)目(KJS1524) 中國國家留學(xué)基金項(xiàng)目(201606920013) 浙江省自然科學(xué)基金(LY16F010019)資助~~
【分類號(hào)】：TP181

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 朱向陽,丁漢,熊有倫;凸多面體之間的偽最小平移距離——Ⅰ.定義及其性質(zhì)[J];中國科學(xué)E輯:技術(shù)科學(xué);2001年02期

2 周水生,容曉鋒,周利華;計(jì)算兩個(gè)凸多面體間距離的一個(gè)新算法[J];蘇州科技學(xué)院學(xué)報(bào);2003年02期

3 許如初,宋恩民,董向鋒;求包含三維空間中給定點(diǎn)集最小凸多面體算法研究[J];武漢交通科技大學(xué)學(xué)報(bào);1997年02期

4 費(fèi)燕瓊,趙錫芳;基于凸多面體邊界元的接觸狀態(tài)判斷[J];機(jī)械工程學(xué)報(bào);2005年01期

5 耿志勇,黃琳;多輸入多輸出系統(tǒng)在凸多面體攝動(dòng)模式下H_∞魯棒性能[J];控制理論與應(yīng)用;2000年05期

6 周水生,容曉鋒,周利華;判斷兩個(gè)凸多面體相交的簡單方法[J];寶雞文理學(xué)院學(xué)報(bào)(自然科學(xué)版);2002年01期

7 王建平,馮光濤,趙錫芳;機(jī)器人裝配中的幾何不確定性建模[J];上海交通大學(xué)學(xué)報(bào);2001年12期

8 吳海霞;馮偉;鄒曉兵;;基于凸多面體方法的時(shí)滯和連續(xù)系統(tǒng)穩(wěn)定性分析[J];計(jì)算機(jī)應(yīng)用研究;2014年05期

9 耿魁,高洪華,崔丹,任世軍;用神經(jīng)網(wǎng)絡(luò)求解空間中兩凸多面體間最短距離[J];黑龍江水專學(xué)報(bào);2000年01期

10 任世軍,hope.hit.edu.cn,洪炳熔,孟慶鑫;判斷兩個(gè)凸多面體是否相交的一個(gè)快速算法[J];軟件學(xué)報(bào);2000年04期

相關(guān)會(huì)議論文前3條

1 楚天廣;黃琳;;凸多面體系統(tǒng)族的魯棒正不變集-混合單調(diào)方法[A];1996年中國控制會(huì)議論文集[C];1996年

2 蔣衛(wèi)華;黃琳;楚天廣;;離散凸多面體系統(tǒng)族的魯棒正不變集——混合單調(diào)方法[A];1997年中國控制會(huì)議論文集[C];1997年

3 郭祥貴;王武;楊富文;陳四雄;;凸多面體不確定系統(tǒng)的魯棒L_2-L_∞控制[A];2007年中國智能自動(dòng)化會(huì)議論文集[C];2007年

相關(guān)博士學(xué)位論文前2條

1 張彥虎;線性凸多面體不確定離散系統(tǒng)的分析與綜合[D];浙江大學(xué);2006年

2 衷路生;狀態(tài)空間模型辨識(shí)方法研究[D];中南大學(xué);2011年

相關(guān)碩士學(xué)位論文前4條

1 郭曉寶;凸多面體不確定時(shí)滯系統(tǒng)均方指數(shù)穩(wěn)定性的研究[D];合肥工業(yè)大學(xué);2012年

2 胡軍;凸多面體不確定離散線性系統(tǒng)的魯棒性分析[D];哈爾濱理工大學(xué);2009年

3 伊騫鶴;基于凸多面體模型的網(wǎng)絡(luò)控制系統(tǒng)設(shè)計(jì)[D];哈爾濱工業(yè)大學(xué);2010年

4 周e鴈，

本文編號(hào)：2393244

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2393244.html

上一篇：紋理識(shí)別觸覺傳感器的設(shè)計(jì)與實(shí)現(xiàn)
下一篇：全國健康扶貧數(shù)據(jù)采集系統(tǒng)的構(gòu)建

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于凸多面體抽象域的自適應(yīng)強(qiáng)化學(xué)習(xí)技術(shù)研究