非標(biāo)準(zhǔn)Multi-armed bandit的隨機(jī)調(diào)度

發(fā)布時間：2018-03-18 09:21

本文選題：最優(yōu)停時　切入點：允許停時　出處：《華東師范大學(xué)》2016年博士論文　論文類型：學(xué)位論文

【摘要】：本文的主要目的是拓展具有指數(shù)策略的multi-armed bandit (MAB)隨機(jī)調(diào)度模型,使之更符合復(fù)雜的現(xiàn)實背景：(1)諸arm具有不同的切換限制；(2)諸arm具有不同的折現(xiàn)率；(3)機(jī)器隨機(jī)中斷引起的不完全信息。為此,本文的另一個目的是研究帶限制的最優(yōu)停時問題和非參貝葉斯,使之適用于上述非標(biāo)準(zhǔn)的MAB。在隨機(jī)變量集合的層面上,在帶限制的停時類范圍內(nèi),討論最優(yōu)停時問題,運用經(jīng)典的概率理論給出一般結(jié)論。這理論涵蓋離散時間、連續(xù)時間、半馬氏框架下所得的經(jīng)典結(jié)果。大致分三個階段：在第一階段在單指標(biāo)的隨機(jī)變量集的框架下展開,首先引入允許停時類的概念,建立帶限制的最優(yōu)停時模型,討論兩類價值族和最優(yōu)停時的性質(zhì)；接著建構(gòu)最優(yōu)停時存在的充分條件,進(jìn)而討論價值變量族的局部性質(zhì)、正則性等。在第二階段,把最優(yōu)停時問題拓展到雙指標(biāo)容許隨機(jī)變量類上,研究最優(yōu)雙停時的性質(zhì),所得結(jié)果自然可推廣到多指標(biāo)的情形。第三階段,討論第一階段中的可及集,證明了可及集的可列停時分解的性質(zhì)。在連續(xù)時間的隨機(jī)MAB模型中,考慮了相互獨立的arm均有自身允許的停止范圍,且只有在該范圍上才能切換,目標(biāo)是最大化在無限時間上的期望總折扣報酬。首先,引入允許停止隨機(jī)集的概念,建立過程版的帶停止限制的最優(yōu)停時一般理論；接著,基于EL Karoui and Karatzas (1994)的想法,運用所得的理論解決單arm的報酬過程與Gittins指標(biāo)過程的關(guān)系,最后,運用Kaspi and Mandelbaum (1998)的偏移法(excursion method)證明Gittins指標(biāo)的最優(yōu)性,其中的論證過程也比以往的證明簡潔。在連續(xù)時間的隨機(jī)MAB模型中,同時了考慮arm的切換要求和變折現(xiàn)的情況。分別采用兩種期望總折扣報酬,運用帶限制的最優(yōu)停時理論,導(dǎo)出相應(yīng)的指數(shù)定義,運用偏移法,證明了其一指標(biāo)為最優(yōu)策略,而另一卻不是。運用貝葉斯方法把帶隨機(jī)中斷的調(diào)度問題轉(zhuǎn)化為不完全信息的調(diào)度問題,選擇期望折扣報酬為目標(biāo)函數(shù),分別在靜態(tài)策略、動態(tài)策略下討論最優(yōu)指數(shù)策略特點,尤其是動態(tài)策略中的一步報酬率的情況,目的是想了解不同的貝葉斯框架對調(diào)度策略的影響。在靜態(tài)策略下,采用一般框架與參數(shù)框架所得的結(jié)論基本相似；而就動態(tài)策略而言,通過分析兩個例子的一步報酬率與貝葉斯框架的之間的關(guān)系,以此說明不同的貝葉斯結(jié)構(gòu)對調(diào)度的影響。
[Abstract]:The main purpose of this paper is to extend the multi-armed bandit mabs stochastic scheduling model with exponential policy. Make it more in line with the complex realistic background: 1) the arm has different handoff restrictions / 2) and the arm has different discount rate / / 3) the incomplete information caused by the random interruption of the machine. Another purpose of this paper is to study the optimal stopping time problem with constraints and non-parametric Bayes, so that it can be applied to the above mentioned non-standard MAB.The optimal stopping time problem is discussed on the level of random variable set and within the stopping time class with constraints. A general conclusion is given by using the classical probability theory. This theory covers the classical results of discrete time, continuous time and semi-Markov frame. It is roughly divided into three stages: in the first stage, the results are expanded under the framework of a single index random variable set. Firstly, the concept of allowable stopping class is introduced, and a constrained optimal stopping time model is established to discuss the properties of two classes of value family and optimal stopping time, then the sufficient conditions for the existence of optimal stopping time are constructed, and then the local properties of the family of value variables are discussed. In the second stage, the optimal stopping time problem is extended to the class of two-parameter admissible random variables, and the properties of the optimal double stopping time are studied. In this paper, we discuss the reachability set in the first stage, and prove the property of the countable stopping time decomposition of the reachable set. In the continuous time stochastic MAB model, we consider that each independent arm has its own allowable stop range, and only in this range can we switch. The goal is to maximize the expected total discounted return in infinite time. Firstly, the concept of allowing stopping random sets is introduced, and the general theory of optimal stopping time with stop limit is established. Then, based on the idea of El Karoui and Karatzas 1994), The obtained theory is used to solve the relationship between the return process of a single arm and the Gittins index process. Finally, the excursion method of Kaspi and Mandelbaum 1998) is used to prove the optimality of the Gittins index. In the stochastic MAB model with continuous time, the switching requirements of arm and the case of variable discounting are taken into account. Two kinds of expected total discounted returns are adopted, and the optimal stopping time theory with restrictions is used. The corresponding exponential definition is derived, and the migration method is used to prove that one index is the optimal strategy while the other is not. The Bayesian method is used to transform the scheduling problem with random interruption into a scheduling problem with incomplete information. Choosing the expected discount return as the objective function, we discuss the characteristics of the optimal exponential strategy under static and dynamic strategies, especially the one-step return rate in the dynamic strategy. The purpose of this paper is to understand the influence of different Bayesian frameworks on scheduling policies. In static policies, the conclusions obtained by using general frameworks and parameter frameworks are basically similar. By analyzing the relationship between the one-step rate of return and the Bayesian framework of two examples, the influence of different Bayesian structures on scheduling is illustrated.
【學(xué)位授予單位】：華東師范大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2016
【分類號】：O212.8;F224

【相似文獻(xiàn)】

相關(guān)會議論文前2條

1 Gaojie;;Positive education significance analysis of educational psychology on armed police officers and soldiers[A];2013年教育技術(shù)與管理科學(xué)國際會議論文集[C];2013年

2 Ye.M.Zholumbetov;Yeldar Zholumbetov;;THE ROLE OF CONFLICTS IN WORLD ECONOMY DEVELOPMENT: ARABIC COUNTRIES[A];2012 North-East Asia Academic Forum[C];2012年

相關(guān)博士學(xué)位論文前1條

1 包文清;非標(biāo)準(zhǔn)Multi-armed bandit的隨機(jī)調(diào)度[D];華東師范大學(xué);2016年

，

本文編號：1628980

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/jjglbs/1628980.html

上一篇：基于違約風(fēng)險判別的小型工業(yè)企業(yè)信用評級研究
下一篇：借鑒日本經(jīng)驗的我國質(zhì)量信用管控策略優(yōu)化研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

非標(biāo)準(zhǔn)Multi-armed bandit的隨機(jī)調(diào)度