基于连续时间半马尔可夫决策过程的Option算法被引量：2

Option Algorithm Based on Continuous-Time Semi-Markov Decision Process

下载PDF

导出

摘要针对大规模或复杂的随机动态规划系统,可利用其分层结构特点或引入分层控制方式,借助分层强化学习(Hierarchical Reinforcement Learning,HRL)来解决其"维数灾"和"建模难"问题.HRL归属于样本数据驱动优化方法,通过空间/时间抽象机制,可有效加速策略学习过程.其中,Option方法可将系统目标任务分解成多个子目标任务来学习和执行,层次化结构清晰,是具有代表性的HRL方法之一.传统的Option算法主要是建立在离散时间半马尔可夫决策过程(Semi-Markov Decision Processes,SMDP)和折扣性能准则基础上,无法直接用于解决连续时间无穷任务问题.因此本文在连续时间SMDP框架及其性能势理论下,结合现有的Option算法思想,运用连续时间SMDP的相关学习公式,建立一种适用于平均或折扣性能准则的连续时间统一Option分层强化学习模型,并给出相应的在线学习优化算法.最后通过机器人垃圾收集系统为仿真实例,说明了这种HRL算法在解决连续时间无穷任务优化控制问题方面的有效性,同时也说明其与连续时间模拟退火Q学习相比,具有节约存储空间、优化精度高和优化速度快的优势. For large-scale or complex systems with stochastic dynamic programming, we can refer to hierarchical reinforcement learning （HRL） to overcome the curse of dimensionality and the curse of modeling according to their hierarchical structures or hierarchical control modes. HRL belongs to the methodology of sample data-driven optimization, and due to the introduction of spatial or temporal abstraction mechanism, it can be used to accelerate the process of policy learning. The Option method is one of the HRL techniques which can decompose the task of the system into multiple subtasks for learning and implementation. The traditional Option methods are based on discrete-time semi-Markov decision process （SMDP） with discounted criteria, which cannot apply to continuous-time infinite tasks. Therefore, in this paper, we extend the existing Option algorithms to continuous-time case by utilizing relative learning formula of continuous- time SMDPs, and propose a unified online Option algorithm that applies to either average or discounted criteria. The algorithm is under the framework of performance potential theory and continuous-time SMDP model. Finally, we illustrate the effectiveness of the proposed HRLalgorithm in solving the optimization problem of continuous-time infinite tasks by a robotic garbage collection system. The simulation results show that it needs less memory, and has better optimization performance and faster learning speed than a continuous-time flat Q-learning algorithm based on simulated annealing technique.

作者唐昊张晓艳韩江洪周雷

机构地区合肥工业大学计算机与信息学院合肥工业大学电气与自动化工程学院

出处《计算机学报》 EI CSCD 北大核心 2014年第9期2027-2037,共11页 Chinese Journal of Computers

基金国家自然科学基金(61174188,71231004,61374158) 国家国际科技合作项目(2011FA10440) 教育部新世纪优秀人才计划项目(NCET-11-0626) 高等学校博士学科点专项科研基金(博导类)(20130111110007)资助~~

关键词连续时间半Markov决策过程分层强化学习 Q学习 Continuous-Time Semi-Markov Decision Process （CT-SMDP） Hierarchical Rein-forcement Learning （HRL） Q-learning

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献3

1高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
2彭志平,李绍平.分层强化学习研究进展[J].计算机应用研究,2008,25(4):974-978. 被引量：7
3WEILI QingtaiYE ChangmingZHU.APPLICATION OF HIERARCHICAL REINFORCEMENT LEARNING IN ENGINEERING DOMAIN[J].Journal of Systems Science and Systems Engineering,2005,14(2):207-217. 被引量：3

二级参考文献68

1WEILI QingtaiYE ChangmingZHU.APPLICATION OF HIERARCHICAL REINFORCEMENT LEARNING IN ENGINEERING DOMAIN[J].Journal of Systems Science and Systems Engineering,2005,14(2):207-217. 被引量：3
2[1]Bao, G., C. G. Cassandras, T. E. Djaferis,A.D. Gandhi, and D. P. Looze, "Elevator dispatchers for down peak traffic", ECE Department Technical Report, University of Massachusetts, 1994.
3[2]Barto, A. G., S. Mahadevan, "Recent advances in hierarchical reinforcement learning", Discrete Event Dynamic Systems:Theory and Applications, Vol. 13, pp41-77,2003.
4[3]Bradtke, S. J. and M. O. Duff,"Reinforcement learning methods for continuous-time Markov decision problems", Advances in Neural Information Processing Systems 7,Cambridge, MA, 1995.
5[4]Crites, R. H. and A. G. Barto, "Improving elevator performance using reinforcement learning", Advances in Neural Information Processing Systems 8, pp1017-1023, 1996.
6[5]Mahadevan, S., M. Nicholas, D. Tapas. and G. Abhijit, "Self-Improving factory simulation using continuous-time average-reward reinforcement learning",Proceedings of the 14th International Conference on Machine Learning (IMLC ′97), Nashville, TN, 1997.
7[6]Mataric, M., "Reinforcement learning in the multi-robot domain", Autonomous Robots, Vol. 4, No. 1, pp73-83, 1997.
8[7]Parr, R., "Hierarchical control and learning for markov decision processes", Ph.D.dissertation, University of California,Berkeley, CA, 1998.
9[8]Rajbala, M., M. Sridhar, and G.Mohammad, "Hierarchical multi-agent reinforcement learning", Proceedings of the fifth International Conference on Autonomous Agents, pp246-253, 2001.
10[9]Sutton, R.S. and A.G. Barto, Reinforcement Learning: An Introduction, Cambridge,MA: MIT Press, 1998.

共引文献44

1Di Cao,Weihao Hu,Junbo Zhao,Guozhou Zhang,Bin Zhang,Zhou Liu,Zhe Chen,Frede Blaabjerg.Reinforcement Learning and Its Applications in Modern Power and Energy Systems: A Review[J].Journal of Modern Power Systems and Clean Energy,2020,8(6):1029-1042. 被引量：25
2李瑾,刘全,杨旭东,杨凯,翁东良.一种改进的平均奖赏强化学习方法在RoboCup训练中的应用[J].苏州大学学报（自然科学版）,2012,28(2):21-26. 被引量：2
3宋炯,金钊,杨维和.机器学习中加速强化学习的一种函数方法[J].云南大学学报（自然科学版）,2011,33(S2):176-181.
4彭志平,李绍平.一种基于PSO的分层策略搜索算法[J].模式识别与人工智能,2008,21(1):98-103. 被引量：1
5彭志平,李绍平.分层强化学习研究进展[J].计算机应用研究,2008,25(4):974-978. 被引量：7
6张捍东,吴玉秀,岑豫皖.多机器人合作与协调研究进展[J].计算机工程与应用,2008,44(24):238-241. 被引量：4
7王巍巍,陈兴国,高阳.一种结合Tile Coding的平均奖赏强化学习算法[J].模式识别与人工智能,2008,21(4):446-452.
8廉佐政,王海珍,邓文新,滕艳平.应用记忆演化学习的Agent协商研究[J].计算机工程与应用,2009,45(19):131-133. 被引量：1
9王冠军,王茂励,赵莹.基于马尔可夫决策模型的测试向量排序新方法[J].计算机科学,2010,37(5):287-290. 被引量：1
10付燕宁,张家臣,刘磊.面向预定义过程的强化学习WS组合[J].吉林大学学报（工学版）,2010,40(5):1313-1317.

同被引文献20

1沈晶,顾国昌,刘海波.未知动态环境中基于分层强化学习的移动机器人路径规划[J].机器人,2006,28(5):544-547. 被引量：15
2Konidaris G, Barto A. Efficient skill learning using ab-straction selection [ C ]// Proceedings of the 21st Interna-tional Joint Conference on Artifical Intelligence. 2009 :1107-1112.
3Rozo L, Jimenez P, Torras C. A robot learning from demon-stration framework to perform force-based manipulation tasks[J]. Intelligent Service Robotics, 2013,6( 1) :33-51.
4Prins N W, Sanchez J C, Prasad A. A confidence metricfor using neurobiological feedback in actor-critic reinforce-ment learning based brain-machine interfaces [ J ]. Fron-tiers in Neuroscience, 2014,8 : 111.
5Jandhyala V,Fotopoulos S,Macneill I, et al. Inference forsingle and multiple change-points in time series[ J]. Jour-nal of Time Series Analysis, 2013,34(4) :423-446.
6Kress-Gazit H, Pappas G J. Automatic synthesis of robotcontrollers for tasks with locative prepositions [ C ] // 2010IEEE International Conference on Robotics and Automa-tion. 2010:3215-3220.
7Gupta K,Singh H P, Biswal B, et al. Adaptive targetingof chaotic response in periodically stimulated neural sys-tems [J ]. Chaos An Interdisciplinary Journal of NonlinearScience, 2006,16(2) ;360-375.
8Xuan Xiang, Murphy K. Modeling changing dependencystructure in multivariate time series [ C ]// Proceedings ofthe 24th International Conference on Machine Learning.2007:1055-1062.
9Vien N A, Ertel W,Chung T C. Learning via human feed-back in continuous state and action spaces[ J]. Applied In-telligence, 2013,39(2) :267-278.
10Boularias A, Chaib-Draa B. Apprenticeship learning withfew examples[ J]. Neurocomputing, 2013,104( 3) :83-96.

引证文献2

1王蕾.一种基于示例轨迹的抽象动作树构造方法[J].计算机与现代化,2016(6):85-90. 被引量：1
2朱斐,许志鹏,刘全,伏玉琛,王辉.基于可中断Option的在线分层强化学习方法[J].通信学报,2016,37(6):65-74. 被引量：4

二级引证文献4

1李荥,王芳,景栋盛,朱斐.一种基于Q学习的无线传感网络路由方法[J].计算技术与自动化,2017,36(2):155-160. 被引量：5
2王月娟,张苏宁,吴水明,朱斐.基于秩的Q-路由选择算法[J].计算机与现代化,2018(10):1-5. 被引量：2
3李翠然,王雪洁,谢健骊,吕安琪.基于改进PSO的铁路监测线性无线传感器网络路由算法[J].通信学报,2022,43(5):155-165. 被引量：8
4张凡,万雪芬,崔剑,刘会丹,蔡婷婷,杨义.面向智慧观光农业的无人机路径规划策略[J].计算机工程与设计,2022,43(7):1905-1914. 被引量：1

1柴雪霞,马学森,周雷,唐昊.基于SMDP模型的Web服务组合优化方法[J].合肥工业大学学报（自然科学版）,2011,34(10):1496-1500. 被引量：4
2李荣华,褚金奎,许长岚,李庆瀛.微导航传感器智能移动机器人平台设计与实现[J].传感器与微系统,2008,27(10):118-120. 被引量：3
3Yong-hui Huang Xian-ping Guo.First Passage Models for Denumerable Semi-Markov Decision Processes with Nonnegative Discounted Costs[J].Acta Mathematicae Applicatae Sinica,2011,27(2):177-190. 被引量：2
4张乐.工程数据库程序设计语言HRL[J].河海科技进展,1992,12(1):87-94.
5王化祥,孙金刚.一种模糊神经网络自适应预测控制方案的研究[J].天津大学学报（自然科学与工程技术版）,2000,33(4):428-431. 被引量：4
6汤赫男.计算机数据库安全管理研究[J].黑龙江科学,2014,5(11):220-220.
7苏畅,高阳,陈世福,陈兆乾.基于SMDP环境的自主生成options算法的研究[J].模式识别与人工智能,2005,18(6):679-684. 被引量：9
8程燕,唐昊,马学森.基于策略迭代和遗传算法的SMDP鲁棒控制策略求解[J].合肥工业大学学报（自然科学版）,2007,30(11):1404-1407. 被引量：1
9彭志平,李绍平.分层强化学习研究进展[J].计算机应用研究,2008,25(4):974-978. 被引量：7
10唐昊,陈栋,周雷,吴玉华.SMDP基于Actor网络的统一NDP方法[J].控制与决策,2007,22(2):155-159.

计算机学报

2014年第9期

浏览历史

内容加载中请稍等...

基于连续时间半马尔可夫决策过程的Option算法被引量：2

参考文献3

二级参考文献68

共引文献44

同被引文献20

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于连续时间半马尔可夫决策过程的Option算法 被引量：2

参考文献3

二级参考文献68

共引文献44

同被引文献20

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于连续时间半马尔可夫决策过程的Option算法被引量：2