双马尔可夫决策过程联合模型

Associated Model of Bi-Markov Decision Processes

下载PDF

导出

摘要人类在处理问题中往往分为两个层次,首先在整体上把握问题,即提出大体方案,然后再具体实施。也就是说人类就是具有多分辨率智能系统的极好例子,他能够在多个层次上从底向上泛化(即看问题角度粒度变"粗",它类似于抽象),并且又能从顶向下进行实例化(即看问题角度变"细",它类似于具体化)。由此构造了由在双层(理想空间即泛化和实际空间即实例化)上各自运行的马尔可夫决策过程组成的半马尔可夫决策过程,称之为双马尔可夫决策过程联合模型。然后讨论该联合模型的最优策略算法,最后给出一个实例说明双马尔可夫决策联合模型能够经济地节约"思想",是运算有效性和可行性的一个很好的折中。 Human thought is often divided two levels while dealing with problems. First people always treat problems from a whole perspective, i. e. , they have a general plan, then they specifically deal with details. The human itself is a good example for having a multi-resolutional characteristic. It can not only generalize bottom-up among multi-levels （the granule of viewpoint about problem becomes ＂rough＂, analogous to abstract）, but also instantiate top-down （the granule of viewpoint becomes ＂thin＂,analogous to specification）. So we constructed a semi-Markov decision process consisting of two Markov decision processes running respectively on two levels--the ideal space （generalization） and the actual space （instantiation）. It is called an associated bi-Markov decision model Then we discussed how to find optimal policy under this associated model. Finally an example was given to show that the associated bi-Markov decision process model can economically economize ＂mind＂ and is a good tradeoff between the computational validity and computational feasibility.

作者王蓁蓁邢汉承

机构地区南京大学计算机科学与技术系东南大学计算机科学与工程学院

出处《计算机科学》 CSCD 北大核心 2009年第9期161-166,共6页 Computer Science

基金国家自然科学基金(90412014 60803061) 江苏省自然科学基金(BK2008293)资助

关键词马尔可夫决策过程增强学习最优策略 Markov decision processes, Reinforcement learning, Optimal policy

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献12

1Sutton R S,Precup D,Singh S.Between MDPs and semi-MDPs:a framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112:181-211.
2Kersting K,Raedt L D.Logical markov decision programs[C]//Working Notes of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data (SRL-03).Acapulco,Mexico,2003:63-70.
3Kersting K,Raedt L D.Logical Markov Decision Programs and the Convergence of Logic TD (λ)[C]//Proceeding of The 14th International Conference of Inductive Logic Programming.Porto,Portugal,2004:180-197.
4Otterlo M V.Reinforcement Learning for Relational MDPs[C]//Proceedings of the Annual Machine Learning Conference of Belgium and the Netherlands.Brussels,Belgium,2004:138-145.
5Wiering M,SchmidhuberJ.HQ-Learning:Discovering Markovian Subgoals for Non-Markovian Reinforcement learning[R].IDSIA-95-96.
6Whitehead S D,Lin L-J.Reinforcement learning of non-Markov decision processes[J].Artificial Intelligence,1995,73:271-306.
7Das T K,Gosavi A,Marchalleck S M N.Solving Semi-Markov Decision Problems using Average Reward Reinforcement learning[J].Management Science,1999,4(45):560-574.
8Ravindran B,Barto A G.Symmetries and Model Minimization in Markov Decision Processes[R].UM-CS-2001-043.
9Ravindran B,Barto A G.SMDP Homomorphisms:An Algebraic Approach to Abstraction in Semi-Markov Decision Processes[C]//Proceeding of the Eighteenth International Joint Conference on Artificial Intelligence.2003.
10Kersting K,Otterlo M V,Raedt L D.Bellman goes Relational[C]//Proceedings of the 21st International Conference on Machine learning.Banff,Canada,2004.

二级参考文献10

1Boutilier C, Reiter R, Price B. Symbolic Dynamic Programming for First-order MDPs//Seventeenth International Joint Conference on Artificial Intelligence(UCAI-01). Seattle: USA, 2001: 690-700
2Guestrin C, Koller D, Parr R, et al. Efficient Solution Algorithms for Factored MDPs. Journal of Artificial Intelligence Research, 2003,19 : 399-468
3Kersting K, Raedt L D. Logical Markov Decision Programs//Working Notes of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data (SRL-03). Acapulco, Mexico. 200S. 6S-70
4Kersting K, Raedt L D. Logical Markov Decision Programs and the Convergence of Logical TD(λ)//Proceeding of The 14^th International Conference of Inductive Logic Programming. Porto, Portugal, 2204 : 180-197
5van Otterlo M. Reinforcement Learning for Relational MDPs//Proceedings of the Annual Machine Learning Conference of Belgium and the Netherlands. Brussels, Belgium, 2004;138-145
6Kersting K, van Otterlo M, Raedt L D. Bellman goes to Relational//Proceedings of the 21^st International Conference on Machine Learning. Banff, Canada, 2004
7Fern A, Yoon S, Givan R. Approxiamte Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes. Journal of Artificial Intelligence Research, 2006,25 : 75-118
8张汝波,顾国昌,刘照德,王醒策.强化学习理论、算法及应用[J].控制理论与应用,2000,17(5):637-642. 被引量：91
9仲宇,顾国昌,张汝波.多智能体系统中的分布式强化学习研究现状[J].控制理论与应用,2003,20(3):317-322. 被引量：12
10高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-100. 被引量：262

共引文献3

1何兆成,佘锡伟,杨文臣,陈宁宁.结合Q学习和模糊逻辑的单路口交通信号自学习控制方法[J].计算机应用研究,2011,28(1):199-202. 被引量：11
2陈丽娜,黄宏斌,邓苏.基于一阶信念点的一阶POMDP值迭代算法研究[J].计算机工程与应用,2012,48(15):7-11.
3王步云,刘聚.作战Agent的学习算法研究进展与发展趋势[J].兵工自动化,2023,42(9):74-78.

1杨晓庆.计算机系统与计算机网络中的动态优化:模型、求解与应用[J].计算机光盘软件与应用,2014,17(9):108-108. 被引量：3
2赵飞,刘宁,秦敏.计算机系统与计算机网络中的动态优化[J].山东工业技术,2016(6):142-142. 被引量：1
3李畅,聂定远,刘东.马尔可夫决策在Web服务选择中的应用[J].高等函授学报（自然科学版）,2007,20(2):38-40.
4储毅,赵敏.基于马尔可夫决策的动态电源管理技术[J].电子科技大学学报,2007,36(3):521-523. 被引量：3
5刘甜甜,贾智平,Edwin H. -M. Sha.嵌入式通信系统中基于动态多因素的马尔可夫决策路由[J].上海交通大学学报,2007,41(10):1607-1607.
6李向鹏.基于马尔可夫决策过程的无线传感器网络速率控制[J].计算机与现代化,2012(7):152-154. 被引量：1
7孙玲玲,姜红霞,黄廷祥.工控软件中绘制曲线的实现[J].试验技术与试验机,2005,45(1):51-53.
8耿少峰,王永恒,李仁发,张佳.主动式复杂事件处理方法的研究[J].通信学报,2016,37(9):111-120. 被引量：1
9王宏勇.网站安全的主要隐患及设计原则分析[J].才智,2008,0(19):133-133.
10一心.网御星云发布TD3000系列入侵检测系统[J].信息安全与通信保密,2013,11(8):36-36.

计算机科学

2009年第9期

浏览历史

内容加载中请稍等...

双马尔可夫决策过程联合模型

参考文献12

二级参考文献10

共引文献3

相关作者

相关机构

相关主题

浏览历史