摘要
使用定长情景进行学习的eNAC(episodic Natural Actor-Critic)算法是一种在理论上具有良好学习性能的强化学习控制算法,但其学习过程需要采样较多的定长情景,学习效率低.为此,文中提出了一种新的强化学习控制算法ER-eNAC.该算法在eNAC算法的基础上引入了定长情景复用机制,在自然策略梯度估计过程中,复用部分过去采样的定长情景以更有效地利用经验信息;在使用复用的定长情景时,按照其参与的策略更新次数进行指数递减加权以描述其对当前策略的适用性.倒立摆稳定控制问题的仿真结果表明,与eNAC算法相比,ER-eNAC算法显著减少了学习过程中需要采样的定长情景的条数,提高了学习效率.
Though eNAC(episodic Natural Actor-Critic) algorithm,an episode-based reinforcement learning control algorithm,is theoretically of excellent learning performance,it is inefficient in learning because many episodes are required to obtain a good policy.In order to solve this problem,a new algorithm named ER-eNAC,which introduces the episode reuse mechanism in eNAC algorithm,is proposed.In ER-eNAC,some of the past episodes are reused in the estimation procedure of current natural policy gradient for the purpose of using the experience more efficiently,and the reused episodes are weighted in an exponential decay according to the number of policy updates that they have undergone for the purpose of describing their fitness to the current policy.The proposed algorithm is then applied to the inverted pendulum control.Simulated results show that,as compared with eNAC algorithm,ER-eNAC algorithm is more effective because it significantly reduces the number of episodes for learning and remarkably improves the learning efficiency.
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2012年第6期70-75,共6页
Journal of South China University of Technology(Natural Science Edition)
基金
国家自然科学基金青年科学基金资助项目(61004066)
浙江省科技计划项目(2011C23106)
关键词
强化学习
自然策略梯度
经验复用
倒立摆控制
reinforcement learning
natural policy gradient
experience reuse
inverted pendulum control