期刊文献+

基于连续时间半马尔可夫决策过程的Option算法 被引量:2

Option Algorithm Based on Continuous-Time Semi-Markov Decision Process
下载PDF
导出
摘要 针对大规模或复杂的随机动态规划系统,可利用其分层结构特点或引入分层控制方式,借助分层强化学习(Hierarchical Reinforcement Learning,HRL)来解决其"维数灾"和"建模难"问题.HRL归属于样本数据驱动优化方法,通过空间/时间抽象机制,可有效加速策略学习过程.其中,Option方法可将系统目标任务分解成多个子目标任务来学习和执行,层次化结构清晰,是具有代表性的HRL方法之一.传统的Option算法主要是建立在离散时间半马尔可夫决策过程(Semi-Markov Decision Processes,SMDP)和折扣性能准则基础上,无法直接用于解决连续时间无穷任务问题.因此本文在连续时间SMDP框架及其性能势理论下,结合现有的Option算法思想,运用连续时间SMDP的相关学习公式,建立一种适用于平均或折扣性能准则的连续时间统一Option分层强化学习模型,并给出相应的在线学习优化算法.最后通过机器人垃圾收集系统为仿真实例,说明了这种HRL算法在解决连续时间无穷任务优化控制问题方面的有效性,同时也说明其与连续时间模拟退火Q学习相比,具有节约存储空间、优化精度高和优化速度快的优势. For large-scale or complex systems with stochastic dynamic programming, we can refer to hierarchical reinforcement learning (HRL) to overcome the curse of dimensionality and the curse of modeling according to their hierarchical structures or hierarchical control modes. HRL belongs to the methodology of sample data-driven optimization, and due to the introduction of spatial or temporal abstraction mechanism, it can be used to accelerate the process of policy learning. The Option method is one of the HRL techniques which can decompose the task of the system into multiple subtasks for learning and implementation. The traditional Option methods are based on discrete-time semi-Markov decision process (SMDP) with discounted criteria, which cannot apply to continuous-time infinite tasks. Therefore, in this paper, we extend the existing Option algorithms to continuous-time case by utilizing relative learning formula of continuous- time SMDPs, and propose a unified online Option algorithm that applies to either average or discounted criteria. The algorithm is under the framework of performance potential theory and continuous-time SMDP model. Finally, we illustrate the effectiveness of the proposed HRLalgorithm in solving the optimization problem of continuous-time infinite tasks by a robotic garbage collection system. The simulation results show that it needs less memory, and has better optimization performance and faster learning speed than a continuous-time flat Q-learning algorithm based on simulated annealing technique.
出处 《计算机学报》 EI CSCD 北大核心 2014年第9期2027-2037,共11页 Chinese Journal of Computers
基金 国家自然科学基金(61174188,71231004,61374158) 国家国际科技合作项目(2011FA10440) 教育部新世纪优秀人才计划项目(NCET-11-0626) 高等学校博士学科点专项科研基金(博导类)(20130111110007)资助~~
关键词 连续时间半Markov决策过程 分层强化学习 Q学习 Continuous-Time Semi-Markov Decision Process (CT-SMDP) Hierarchical Rein-forcement Learning (HRL) Q-learning
  • 相关文献

参考文献3

二级参考文献68

  • 1WEILI QingtaiYE ChangmingZHU.APPLICATION OF HIERARCHICAL REINFORCEMENT LEARNING IN ENGINEERING DOMAIN[J].Journal of Systems Science and Systems Engineering,2005,14(2):207-217. 被引量:3
  • 2[1]Bao, G., C. G. Cassandras, T. E. Djaferis,A.D. Gandhi, and D. P. Looze, "Elevator dispatchers for down peak traffic", ECE Department Technical Report, University of Massachusetts, 1994.
  • 3[2]Barto, A. G., S. Mahadevan, "Recent advances in hierarchical reinforcement learning", Discrete Event Dynamic Systems:Theory and Applications, Vol. 13, pp41-77,2003.
  • 4[3]Bradtke, S. J. and M. O. Duff,"Reinforcement learning methods for continuous-time Markov decision problems", Advances in Neural Information Processing Systems 7,Cambridge, MA, 1995.
  • 5[4]Crites, R. H. and A. G. Barto, "Improving elevator performance using reinforcement learning", Advances in Neural Information Processing Systems 8, pp1017-1023, 1996.
  • 6[5]Mahadevan, S., M. Nicholas, D. Tapas. and G. Abhijit, "Self-Improving factory simulation using continuous-time average-reward reinforcement learning",Proceedings of the 14th International Conference on Machine Learning (IMLC ′97), Nashville, TN, 1997.
  • 7[6]Mataric, M., "Reinforcement learning in the multi-robot domain", Autonomous Robots, Vol. 4, No. 1, pp73-83, 1997.
  • 8[7]Parr, R., "Hierarchical control and learning for markov decision processes", Ph.D.dissertation, University of California,Berkeley, CA, 1998.
  • 9[8]Rajbala, M., M. Sridhar, and G.Mohammad, "Hierarchical multi-agent reinforcement learning", Proceedings of the fifth International Conference on Autonomous Agents, pp246-253, 2001.
  • 10[9]Sutton, R.S. and A.G. Barto, Reinforcement Learning: An Introduction, Cambridge,MA: MIT Press, 1998.

共引文献44

同被引文献20

  • 1沈晶,顾国昌,刘海波.未知动态环境中基于分层强化学习的移动机器人路径规划[J].机器人,2006,28(5):544-547. 被引量:15
  • 2Konidaris G, Barto A. Efficient skill learning using ab-straction selection [ C ]// Proceedings of the 21st Interna-tional Joint Conference on Artifical Intelligence. 2009 :1107-1112.
  • 3Rozo L, Jimenez P, Torras C. A robot learning from demon-stration framework to perform force-based manipulation tasks[J]. Intelligent Service Robotics, 2013,6( 1) :33-51.
  • 4Prins N W, Sanchez J C, Prasad A. A confidence metricfor using neurobiological feedback in actor-critic reinforce-ment learning based brain-machine interfaces [ J ]. Fron-tiers in Neuroscience, 2014,8 : 111.
  • 5Jandhyala V,Fotopoulos S,Macneill I, et al. Inference forsingle and multiple change-points in time series[ J]. Jour-nal of Time Series Analysis, 2013,34(4) :423-446.
  • 6Kress-Gazit H, Pappas G J. Automatic synthesis of robotcontrollers for tasks with locative prepositions [ C ] // 2010IEEE International Conference on Robotics and Automa-tion. 2010:3215-3220.
  • 7Gupta K,Singh H P, Biswal B, et al. Adaptive targetingof chaotic response in periodically stimulated neural sys-tems [J ]. Chaos An Interdisciplinary Journal of NonlinearScience, 2006,16(2) ;360-375.
  • 8Xuan Xiang, Murphy K. Modeling changing dependencystructure in multivariate time series [ C ]// Proceedings ofthe 24th International Conference on Machine Learning.2007:1055-1062.
  • 9Vien N A, Ertel W,Chung T C. Learning via human feed-back in continuous state and action spaces[ J]. Applied In-telligence, 2013,39(2) :267-278.
  • 10Boularias A, Chaib-Draa B. Apprenticeship learning withfew examples[ J]. Neurocomputing, 2013,104( 3) :83-96.

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部