基于模型的层次化强化学习算法

Hierarchical Reinforcement Learning Based on System Model

下载PDF

导出

摘要针对强化学习算法的状态值泛化和随机探索策略在确定性MDP系统控制中存在着学习效率低的问题,本文提出基于模型的层次化强化学习算法.该算法采用两层结构,底层利用系统模型,采用贪婪策略选择探索动作,完成强化学习任务.而高层通过对状态区域的分析,指导底层的学习,纠正底层错误的动作.高层对底层的学习的指导作用主要包括:在泛化过程中,对泛化区域中正确与错误的状态判断值分别采用不同的学习因子,减小泛化对算法收敛性的影响;建立状态区域的推理规则,用规则指导未知状态区域的学习,加快学习速度;利用系统模型和推理规则,将探索过程集中于系统的可控区域,克服采用随机探索策略需要系统全状态空间内搜索的问题.本文提出的算法能在较短的时间内实现系统的初步控制,其有效性在二级倒立摆的控制中得到验证. This paper elaborates on the low learning efficiency in reinforcement learning due to improper generalization and random exploration policy under deterministic MDPS and proposes a hierarchical reinforcement learning algorithm based on system model. The algorithm adopts the two-lay structure. The low-layer selects the action by the greed policy and the high-layer detects and analyses the state value in the state space, guide the learning of low-layer, corrects the wrong the action selected by low-layer. The high-layer role includes the following： decrease the effect of state value convergence due to the improper generalization by setting the different learning parameters for the state value update in the state space; built the control rule in the state space and accelerate the learning rate by select action according to control rule; reduce the exploration of uncontrollable state space and non-optimal actions and limits the exploration concentrate on the controllable space. The proposed algorithm in this paper can achieve control quickly. Simulation results for the control of double inverted pendulum are presented to show the effectiveness of the proposed algorithm.

作者郑宇罗四维吕子昂

机构地区北京交通大学计算机与信息技术学院

出处《北京交通大学学报》 EI CAS CSCD 北大核心 2006年第5期1-5,共5页 JOURNAL OF BEIJING JIAOTONG UNIVERSITY

基金国家自然科学基金资助项目(60373029)

关键词强化学习马尔科夫决策过程探索策略倒立摆 reinforcement learning markov decision prcoss（MDP） exploration policy inverted pendulum

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献9

1Sutton R, Barto A. Reinforcement Learning: An Introduction[M]. MIT Press, 1998.
2Sekino M, Katagami D, Nitta K. State Space Self Organization based on the Interaction between Basis Functions[C]// IEEE/RSJ International Conference, 2005 : 3295 - 3300.
3Ormoneit D, Glynn P. Kernel-Based Reinforcement Learning in Average-Cost Problems [ J ]. Automatic Control,IEEE Transactions, 2002,47(10) : 1624 - 1636.
4Mitchell T. Machine Learning [ M]. The McGraw-Hill Companies, 1997.
5Sutton R. Learning to Predict by the Method of Temporal Difference[J]. Machine Learning, 1988, 3(1) :9- 44.
6Maozu Guo, Yang Liu, Jacek Malec. A New Q-Learning Algorithm Based on the Metropolis Criterion [ J ]. IEEE Transactions on Systems ,man,and Cybernetics,2004,34(5) :
7Shin I, Yoshida W, Yoshimoto J. Control of Exploitation-Exploration Meta-Parameter in Reinforcement Learning[J]. Neural Network,2002, 15(4- 6) :665 - 687.
8Patrascu R, Stacey D, Adaptive Exploration in Reinforcement Learning[C]// IJCNN '99. International Joint Conference, 1999,4 : 2276 - 2281.
9Pienaar R. Adaptive Control of Human Posture Using Reinforcement Learning[D]. USA: Cleveland state university, 2003.

1阎静,曾建潮,张国有.基于响应阈值的群机器人地图创建探索策略[J].太原科技大学学报,2012,33(5):350-354.
2达索析统发布Isight4.5[J].CAD/CAM与制造业信息化,2010(8):1-1.
3吴万国,李林林,孙良旭.多车型回程车辆调度问题的ADP算法研究[J].计算机应用研究,2013,30(7):1991-1994.
4李春贵,刘永信,王萌.集成规划的行动-自适应评价强化学习算法[J].内蒙古大学学报（自然科学版）,2008,39(3):346-350.
5顾海巍,樊绍巍,金明河,刘宏.基于灵巧手触觉信息的未知物体类人探索策略[J].哈尔滨工程大学学报,2016,37(10):1400-1407. 被引量：4
6马健,俞扬.一种基于全局位置估计误差的路标探索策略[J].智能系统学报,2014,9(3):313-318. 被引量：1
7潘薇,蔡自兴,陈白帆.基于遗传算法的多机器人协作建图方法[J].计算机应用研究,2009,26(4):1289-1291. 被引量：1
8王东署,王海涛.未知环境中自主机器人环境探索与地图构建[J].郑州大学学报（理学版）,2013,45(4):52-57. 被引量：5
9王东署,段谊海,王佳.未知环境中移动机器人的环境探索与地图构建[J].郑州大学学报（理学版）,2014,46(3):96-101. 被引量：3
10王长缨,陈文伟,姚莉.一种基于团队马尔可夫博弈的多agent协同强化学习算法[J].复旦学报（自然科学版）,2004,43(5):842-844. 被引量：2

北京交通大学学报

2006年第5期

浏览历史

内容加载中请稍等...

基于模型的层次化强化学习算法

参考文献9

相关作者

相关机构

相关主题

浏览历史