摘要
针对强化学习算法的状态值泛化和随机探索策略在确定性MDP系统控制中存在着学习效率低的问题,本文提出基于模型的层次化强化学习算法.该算法采用两层结构,底层利用系统模型,采用贪婪策略选择探索动作,完成强化学习任务.而高层通过对状态区域的分析,指导底层的学习,纠正底层错误的动作.高层对底层的学习的指导作用主要包括:在泛化过程中,对泛化区域中正确与错误的状态判断值分别采用不同的学习因子,减小泛化对算法收敛性的影响;建立状态区域的推理规则,用规则指导未知状态区域的学习,加快学习速度;利用系统模型和推理规则,将探索过程集中于系统的可控区域,克服采用随机探索策略需要系统全状态空间内搜索的问题.本文提出的算法能在较短的时间内实现系统的初步控制,其有效性在二级倒立摆的控制中得到验证.
This paper elaborates on the low learning efficiency in reinforcement learning due to improper generalization and random exploration policy under deterministic MDPS and proposes a hierarchical reinforcement learning algorithm based on system model. The algorithm adopts the two-lay structure. The low-layer selects the action by the greed policy and the high-layer detects and analyses the state value in the state space, guide the learning of low-layer, corrects the wrong the action selected by low-layer. The high-layer role includes the following: decrease the effect of state value convergence due to the improper generalization by setting the different learning parameters for the state value update in the state space; built the control rule in the state space and accelerate the learning rate by select action according to control rule; reduce the exploration of uncontrollable state space and non-optimal actions and limits the exploration concentrate on the controllable space. The proposed algorithm in this paper can achieve control quickly. Simulation results for the control of double inverted pendulum are presented to show the effectiveness of the proposed algorithm.
出处
《北京交通大学学报》
EI
CAS
CSCD
北大核心
2006年第5期1-5,共5页
JOURNAL OF BEIJING JIAOTONG UNIVERSITY
基金
国家自然科学基金资助项目(60373029)
关键词
强化学习
马尔科夫决策过程
探索策略
倒立摆
reinforcement learning
markov decision prcoss(MDP)
exploration policy
inverted pendulum