期刊文献+

采用经验复用的高效强化学习控制方法 被引量:1

Efficient Reinforcement-Learning Control Algorithm Using Experience Reuse
下载PDF
导出
摘要 使用定长情景进行学习的eNAC(episodic Natural Actor-Critic)算法是一种在理论上具有良好学习性能的强化学习控制算法,但其学习过程需要采样较多的定长情景,学习效率低.为此,文中提出了一种新的强化学习控制算法ER-eNAC.该算法在eNAC算法的基础上引入了定长情景复用机制,在自然策略梯度估计过程中,复用部分过去采样的定长情景以更有效地利用经验信息;在使用复用的定长情景时,按照其参与的策略更新次数进行指数递减加权以描述其对当前策略的适用性.倒立摆稳定控制问题的仿真结果表明,与eNAC算法相比,ER-eNAC算法显著减少了学习过程中需要采样的定长情景的条数,提高了学习效率. Though eNAC(episodic Natural Actor-Critic) algorithm,an episode-based reinforcement learning control algorithm,is theoretically of excellent learning performance,it is inefficient in learning because many episodes are required to obtain a good policy.In order to solve this problem,a new algorithm named ER-eNAC,which introduces the episode reuse mechanism in eNAC algorithm,is proposed.In ER-eNAC,some of the past episodes are reused in the estimation procedure of current natural policy gradient for the purpose of using the experience more efficiently,and the reused episodes are weighted in an exponential decay according to the number of policy updates that they have undergone for the purpose of describing their fitness to the current policy.The proposed algorithm is then applied to the inverted pendulum control.Simulated results show that,as compared with eNAC algorithm,ER-eNAC algorithm is more effective because it significantly reduces the number of episodes for learning and remarkably improves the learning efficiency.
出处 《华南理工大学学报(自然科学版)》 EI CAS CSCD 北大核心 2012年第6期70-75,共6页 Journal of South China University of Technology(Natural Science Edition)
基金 国家自然科学基金青年科学基金资助项目(61004066) 浙江省科技计划项目(2011C23106)
关键词 强化学习 自然策略梯度 经验复用 倒立摆控制 reinforcement learning natural policy gradient experience reuse inverted pendulum control
  • 相关文献

参考文献13

  • 1Tamei T, Shibata T. Fast reinforcement learning for three- dimensional kinetic human-robot cooperation with EMG-to-activation model [ J ]. Advanced Robotics,2011,25 ( 5 ) : 563-580.
  • 2Han Y K, Kimura H. Motions obtaining of multi-degree- freedom underwater robot by using reinforcement learning algorithms [ C ]//Proceedings of TENCON IEEE Region 10 Conference. Fukuoka: IEEE ,2010 : 1498-1502.
  • 3Peters J, Sehaal S. Natural actor-critic [J]. Neurocompu- ting ,2008,71 ( 7/8/9 ) : 1180-1190.
  • 4Abbeel P. Apprenticeship learning and reinforcement learning with application to robotic control [ D 3. Stanford: Department of Computer Science, Stanford University, 2008. 129-151.
  • 5余涛,胡细兵,刘靖.基于多步回溯Q(λ)学习算法的多目标最优潮流计算[J].华南理工大学学报(自然科学版),2010,38(10):139-145. 被引量:6
  • 6Chu B, Park J, Hong D. Tunnel ventilation controller de- sign using an RLS-based natural actor-critic algorithm [J].International Journal of Precision Engineering and Manufacturing, 2010, 11( 6 ) :829- 838.
  • 7Peters J, Vijayakumar S, Schaal S. Reinforcement learning for humanoid robotics [ C ] //Proceedings of the Third IEEE-RAS International Cortference on Humanoid Robo- tics. Karlsruhe : IEEE ,2003:2002-2021.
  • 8Bhatnagar S, Sutton R S, Ghavamzadeh M, et al. Natural actor-critic algorithms [ J ]. Automatica, 2009,45 ( 11 ) : 2471 - 2482.
  • 9Rosenstein M T, Barto A G. Reinforcement learning with supervision by a stable controller [ C] //Proceedings of the 2004 American Control Cor~ference. Boston : IEEE, 2004 : 4517-4522.
  • 10Lin L J. Self-improving reactive agents based on rein- forcement learning planning and teaching [ J ]. Machine Learning, 1992,8 (3/4) :293-321.

二级参考文献43

  • 1[2]SUTTON R,BARTO A.Reinforcement learning,an introduetion[M].MIT Press,1998.
  • 2[3]SINGH S P.Learning to solve Markovian decision processes[D].University of Massachusetts,1994.
  • 3[4]ROY B V.Learning and value function approximation in complex decision processes[M].MIT Press,1998.
  • 4[5]WATKINS C.Learning from delayed rewards[D].Cambrideg:University of Cambridge,1989.
  • 5[6]HUMPHRYS M.Action selection methods using reinforcement learning[D].Cambrideg:University of Cambridge,1996.
  • 6[7]BERTSEKAS D P,TSITSIKLIS J N.Neuro-dynamic programming[M].Athena Scientific,Belmont,Mass.,1996.
  • 7[8]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy gradient methods for reinforcement learning with function approximation[A].In:Advafices in Neural Information Processing Systems[C].Denver,USA,2000.
  • 8[9]BAIRD L C.Residual algorithms:reinforcement learning with function approximation[A].In:Proc.Of the 12#Int.Conf.on Machine Learning[C].San Francisco,1995.
  • 9[10]TSITSIKLIS J N,ROY V B.Feature-based methods for large scale dynamic programming[J].Machine Learning,1996(22):59-94.
  • 10[12]BAXTER J,BARTLETT P L.Infinite-horizon policygradient estimation[J].Journal of Artificial Intelligence Research,2001(15):319-350.

共引文献11

同被引文献20

  • 1SUTTON R S, BARTO A G. Introduction to reinforce- ment learning IM]. Cambridge: MIT, 1998.
  • 2PETERS J, SCHAAL S. Natural actor-critic [J]. Neu- rocomputing, 2008, 71(7): 1180- 1190.
  • 3GRONDMAN I, BUSONIU L, LOPES G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients ['[1. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6) : 1291 - 1307.
  • 4BHATNAGAR S, SUTTON R S, GHAVAMZADEH M, et al. Natural actor:critic algorithms [J]. Automati- ca, 2009, 45(11).. 2471-2482.
  • 5SUTTON R S. Learning to predict by the methods of temporal differences [J]. Machine Learning, 1988, 3(1): 9 -44.
  • 6ADAM S, BUSONIU L, BABUSKA R. Experience replay for rea[time reinforcement learning control [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42 ( 2): 201 - 212.
  • 7BRADTKE S J, BARTO A G. Linear least-squares algo- rithms for temporal difference learning [J]. Machine Learning, 1996, 22(1/2/3) .. 33 - 57.
  • 8BOYAN J A. Technical update: least-squares temporal difference learning [J]. Machine Learning, 2002, 49(2/ 3) : 233- 246.
  • 9DANN C, NEUMANN G, PETERS J. Policy evalua- tion with temporal differences: a survey and compari- son [J]. The Journal of Machine Learning Research, 2014, 15(1): 809-883.
  • 10GEIST M, PIETQUIN O. Revisiting natural actor-critics with value function approximation [M'] ff Mod- eling DecJsJorts for Artificial Intelligence. Berlin: Springer, 2010 : 207 - 218.

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部