基于动作空间划分的MAXQ自动分层方法

Automatic hierarchical approach of MAXQ based on action space partition

下载PDF

导出

摘要针对分层强化学习需要人工给出层次结构这一问题,同时考虑到基于状态空间的自动分层方法在环境状态中没有明显子目标时分层效果并不理想的情况,提出一种基于动作空间的自动构造层次结构方法。首先,根据动作影响的状态分量将动作集合划分为多个不相交的子集;然后,分析Agent在不同状态下的可用动作,并识别瓶颈动作;最后,由瓶颈动作与执行次序确定动作子集之间的上下层关系,并构造层次结构。此外,对MAXQ方法中子任务的终止条件进行修改,使所提算法构造的层次结构可以通过MAXQ方法找到最优策略。实验结果表明,所提算法可以自动构造层次结构,而不会受环境变化的干扰。与Q学习、Sarsa算法相比,MAXQ方法根据该结构得到最优策略的时间更短,获得回报更高。验证了所提算法能够有效地自动构造MAXQ层次结构,并使寻找最优策略更加高效。 Since a hierarchy of Markov Decision Process （MDP） need to be constructed manually in hierarchical reinforcement learning and some automatic hierarchical approachs based on state space produce unsatisfactory results in environment with not obvious subgoals, a new automatic hierarchical approach based on action space partition was proposed. Firstly, the set of actions was decomposed into some disjoint subsets through the state component of the action. Then, bottleneck actions were identified by analyzing the executable actions of the Agent in different states. Finally, based on the execution order of actions and bottleneck actions, the relationship of action subsets was determined and a hierarchy was constructed. Furthermore, the termination condition for sub-tasks in the MAXQ method was modified so that by using the hierarchical structure of the proposed algorithm the optimal strategy could be found through the MAXQ method. The experimental results show that the algorithm can automatically construct the hierarchical structure which was not affected by environmental change. Compared with the QLearning and Sarsa algorithms, the MAXQ method with the proposed hierarchy obtains the optimal strategy faster and gets higher returns. It verifies that the proposed algorithm can effectively construct the MAXQ hierarchy and make the optimal strategy more efficient.

作者王奇秦进 WANG Qi QIN Jin(College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, Chin)

机构地区贵州大学计算机科学与技术学院

出处《计算机应用》 CSCD 北大核心 2017年第5期1357-1362,共6页 journal of Computer Applications

基金国家自然科学基金资助项目(61562009) 贵州大学引进人才科研项目(贵大人基合字(2012)028号)~~

关键词强化学习分层强化学习自动分层方法马尔可夫决策过程子任务 reinforcement learning hierarchical reinforcement learning automatic hierarchical approach Markov Decision Process （MDP） subtask

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献5

1苏畅,高阳,陈世福,陈兆乾.基于SMDP环境的自主生成options算法的研究[J].模式识别与人工智能,2005,18(6):679-684. 被引量：9
2陈兴国,俞扬.强化学习及其在电脑围棋中的应用[J].自动化学报,2016,42(5):685-695. 被引量：32
3赵冬斌,邵坤,朱圆恒,李栋,陈亚冉,王海涛,刘德荣,周彤,王成红.深度强化学习综述:兼论计算机围棋的发展[J].控制理论与应用,2016,33(6):701-717. 被引量：131
4石川,史忠植,王茂光.基于路径匹配的在线分层强化学习方法[J].计算机研究与发展,2008,45(9):1470-1476. 被引量：4
5陈学松,杨宜民.强化学习研究综述[J].计算机应用研究,2010,27(8):2834-2838. 被引量：61

二级参考文献207

1杨洋,陈小平.动态不确定环境下的决策:一种分层决策模型[J].计算机科学,2005,32(1):151-154. 被引量：1
2苏畅,高阳,陈世福,陈兆乾.基于SMDP环境的自主生成options算法的研究[J].模式识别与人工智能,2005,18(6):679-684. 被引量：9
3王本年,高阳,陈兆乾,谢俊元,陈世福.面向Option的k-聚类Subgoal发现算法[J].计算机研究与发展,2006,43(5):851-855. 被引量：8
4秦志斌,钱徽,朱淼良.自主移动机器人混合式体系结构的一种Multi-agent实现方法[J].机器人,2006,28(5):478-482. 被引量：8
5原魁,李园,房立新.多移动机器人系统研究发展近况[J].自动化学报,2007,33(8):785-794. 被引量：73
6AL-BATAH M S,MATISA N A,ZAMLI K Z,et al.Modified recursive least squares algorithm to train the hybrid multilayered perceptron (HMLP) network[J].Applied Soft Computing,2010,10(1):236-244.
7BOWLING M.Multi agent learning in the presence of agents with limi-tations[R].Pittsburgh:Carnegie Mellon University,2003.
8KYUN Y,OH S-Y.Hybrid control for autonomous mobile robotnavigation using neural network based behavior modules and environment classification[J].Autonomous Robots,2003,15(2):193-206.
9ARAI S,SYCARA K.Multi-agent reinforcement learning for planning and conflict resolution in a dynamic domain[C] //Proc of the 4th International Conference on Autonomous agents.2000:104-105.
10VRANCY P,VERBEEK K,NOWE A.Decetralized learning in Markov games[J].IEEE Trans on Systems,Man and Cyberne-tics Part B:Cybernetics,2008,38(4):976-981.

共引文献221

1刘朝阳,穆朝絮,孙长银.深度强化学习算法与应用研究现状综述[J].智能科学与技术学报,2020(4):314-326. 被引量：45
2李小强,杨凯,代龙飞,夏炜豪,蔡正鑫.冷轧连退机组中央段自动控制系统开发[J].冶金自动化,2023,47(S01):121-124.
3舒忠.基于深度学习的图像样本标签赋值校正算法实现[J].数字印刷,2019(4):38-45. 被引量：2
4张雨.基于马尔科夫过程状态转移矩阵的桥梁结构技术状态预测方法研究[J].四川水泥,2023(3):212-214. 被引量：1
5丁志梁,潘毅群(指导),谢建彤,王尉同,黄治钟.强化学习算法在空调系统运行优化中的应用研究[J].建筑节能,2020(7):14-20. 被引量：7
6彭志平,李绍平.一种基于PSO的分层策略搜索算法[J].模式识别与人工智能,2008,21(1):98-103. 被引量：1
7杜小勤,李庆华,韩建军.一种基于HAMs体系的层次分解方法[J].小型微型计算机系统,2008,29(4):653-658.
8石川,史忠植,王茂光.基于路径匹配的在线分层强化学习方法[J].计算机研究与发展,2008,45(9):1470-1476. 被引量：4
9陈学松,杨宜民.强化学习研究综述[J].计算机应用研究,2010,27(8):2834-2838. 被引量：61
10刘全,闫其粹,伏玉琛,胡道京,龚声蓉.一种基于启发式奖赏函数的分层强化学习方法[J].计算机研究与发展,2011,48(12):2352-2358. 被引量：11

1Mark Burnett 臧铁军(译).限定你的DNS服务器将服务器配置为明确的角色以增强安全性[J].Windows IT Pro Magazine（国际中文版）,2006(11):36-38.
2王晓岩.Java EE中SQL语句的自动构造[J].电脑编程技巧与维护,2009(12):15-16. 被引量：1
3李桂香,刘立.高斯尺度参数自适应算法研究[J].计算机工程与应用,2010,46(14):169-172. 被引量：2
4王月帆.计算机软件开发中分层技术的实践探析[J].济南职业学院学报,2017(1):85-87.
5隐藏上层图形[J].电脑迷,2012(5):63-63.
6杜吉成.云数据中心基于负载权重的负载均衡调度算法[J].现代计算机,2013,19(24):7-11. 被引量：1
7莫建中,周新建.基于工艺仿真的计算机辅助面向制造而设计的研究[J].华东交通大学学报,1996,13(4):21-27.
8刘元.论三网合一形势下的企业数据库安全策略[J].计算机光盘软件与应用,2011(15):109-109.
9张震.网格技术及其在电子政务平台中的应用[J].电子技术（上海）,2003,30(7):22-23. 被引量：6
10Avaya网络分层方法有效保护企业网络安全[J].数字通信世界,2016,0(10):69-69.

计算机应用

2017年第5期

浏览历史

内容加载中请稍等...

基于动作空间划分的MAXQ自动分层方法

参考文献5

二级参考文献207

共引文献221

相关作者

相关机构

相关主题

浏览历史