采用经验复用的高效强化学习控制方法被引量：1

Efficient Reinforcement-Learning Control Algorithm Using Experience Reuse

下载PDF

导出

摘要使用定长情景进行学习的eNAC(episodic Natural Actor-Critic)算法是一种在理论上具有良好学习性能的强化学习控制算法,但其学习过程需要采样较多的定长情景,学习效率低.为此,文中提出了一种新的强化学习控制算法ER-eNAC.该算法在eNAC算法的基础上引入了定长情景复用机制,在自然策略梯度估计过程中,复用部分过去采样的定长情景以更有效地利用经验信息;在使用复用的定长情景时,按照其参与的策略更新次数进行指数递减加权以描述其对当前策略的适用性.倒立摆稳定控制问题的仿真结果表明,与eNAC算法相比,ER-eNAC算法显著减少了学习过程中需要采样的定长情景的条数,提高了学习效率. Though eNAC（episodic Natural Actor-Critic） algorithm,an episode-based reinforcement learning control algorithm,is theoretically of excellent learning performance,it is inefficient in learning because many episodes are required to obtain a good policy.In order to solve this problem,a new algorithm named ER-eNAC,which introduces the episode reuse mechanism in eNAC algorithm,is proposed.In ER-eNAC,some of the past episodes are reused in the estimation procedure of current natural policy gradient for the purpose of using the experience more efficiently,and the reused episodes are weighted in an exponential decay according to the number of policy updates that they have undergone for the purpose of describing their fitness to the current policy.The proposed algorithm is then applied to the inverted pendulum control.Simulated results show that,as compared with eNAC algorithm,ER-eNAC algorithm is more effective because it significantly reduces the number of episodes for learning and remarkably improves the learning efficiency.

作者郝钏钏方舟李平

机构地区浙江大学控制科学与工程学系浙江大学航空航天学院

出处《华南理工大学学报（自然科学版）》 EI CAS CSCD 北大核心 2012年第6期70-75,共6页 Journal of South China University of Technology(Natural Science Edition)

基金国家自然科学基金青年科学基金资助项目(61004066) 浙江省科技计划项目(2011C23106)

关键词强化学习自然策略梯度经验复用倒立摆控制 reinforcement learning natural policy gradient experience reuse inverted pendulum control

分类号 TP273.22 [自动化与计算机技术—检测技术与自动化装置]

引文网络
相关文献

参考文献13

1Tamei T, Shibata T. Fast reinforcement learning for three- dimensional kinetic human-robot cooperation with EMG-to-activation model [ J ]. Advanced Robotics,2011,25 ( 5 ) : 563-580.
2Han Y K, Kimura H. Motions obtaining of multi-degree- freedom underwater robot by using reinforcement learning algorithms [ C ]//Proceedings of TENCON IEEE Region 10 Conference. Fukuoka: IEEE ,2010 : 1498-1502.
3Peters J, Sehaal S. Natural actor-critic [J]. Neurocompu- ting ,2008,71 ( 7/8/9 ) : 1180-1190.
4Abbeel P. Apprenticeship learning and reinforcement learning with application to robotic control [ D 3. Stanford: Department of Computer Science, Stanford University, 2008. 129-151.
5余涛,胡细兵,刘靖.基于多步回溯Q(λ)学习算法的多目标最优潮流计算[J].华南理工大学学报（自然科学版）,2010,38(10):139-145. 被引量：6
6Chu B, Park J, Hong D. Tunnel ventilation controller de- sign using an RLS-based natural actor-critic algorithm [J].International Journal of Precision Engineering and Manufacturing, 2010, 11( 6 ) :829- 838.
7Peters J, Vijayakumar S, Schaal S. Reinforcement learning for humanoid robotics [ C ] //Proceedings of the Third IEEE-RAS International Cortference on Humanoid Robo- tics. Karlsruhe : IEEE ,2003:2002-2021.
8Bhatnagar S, Sutton R S, Ghavamzadeh M, et al. Natural actor-critic algorithms [ J ]. Automatica, 2009,45 ( 11 ) : 2471 - 2482.
9Rosenstein M T, Barto A G. Reinforcement learning with supervision by a stable controller [ C] //Proceedings of the 2004 American Control Cor~ference. Boston : IEEE, 2004 : 4517-4522.
10Lin L J. Self-improving reactive agents based on rein- forcement learning planning and teaching [ J ]. Machine Learning, 1992,8 (3/4) :293-321.

二级参考文献43

1[2]SUTTON R,BARTO A.Reinforcement learning,an introduetion[M].MIT Press,1998.
2[3]SINGH S P.Learning to solve Markovian decision processes[D].University of Massachusetts,1994.
3[4]ROY B V.Learning and value function approximation in complex decision processes[M].MIT Press,1998.
4[5]WATKINS C.Learning from delayed rewards[D].Cambrideg:University of Cambridge,1989.
5[6]HUMPHRYS M.Action selection methods using reinforcement learning[D].Cambrideg:University of Cambridge,1996.
6[7]BERTSEKAS D P,TSITSIKLIS J N.Neuro-dynamic programming[M].Athena Scientific,Belmont,Mass.,1996.
7[8]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy gradient methods for reinforcement learning with function approximation[A].In:Advafices in Neural Information Processing Systems[C].Denver,USA,2000.
8[9]BAIRD L C.Residual algorithms:reinforcement learning with function approximation[A].In:Proc.Of the 12#Int.Conf.on Machine Learning[C].San Francisco,1995.
9[10]TSITSIKLIS J N,ROY V B.Feature-based methods for large scale dynamic programming[J].Machine Learning,1996(22):59-94.
10[12]BAXTER J,BARTLETT P L.Infinite-horizon policygradient estimation[J].Journal of Artificial Intelligence Research,2001(15):319-350.

共引文献11

1席磊,余璐,张弦,胡伟.基于深度强化学习的泛在电力物联网综合能源系统的自动发电控制[J].中国科学：技术科学,2020,50(2):221-234. 被引量：19
2肖力,束雄英,查亚兵.基于支持样本的快速增强学习算法[J].微计算机信息,2009,25(24):136-138.
3陈圣磊,谷瑞军,陈耿,薛晖.基于TD(λ)的自然梯度强化学习算法[J].计算机科学,2010,37(12):186-189. 被引量：2
4程玉虎,冯涣婷,王雪松.基于参数探索的期望最大化策略搜索[J].自动化学报,2012,38(1):38-45. 被引量：4
5邓佳佳,黄元生.基于改进粒子群算法的供电网络优化管理研究[J].生产力研究,2012(3):203-204.
6李靖,余涛,王克英,唐捷.基于强化学习算法的双馈感应风力发电机自校正控制[J].微特电机,2013,41(3):52-55. 被引量：2
7郝钏钏,方舟,李平.基于参考模型的输出反馈强化学习控制[J].浙江大学学报（工学版）,2013,47(3):409-414. 被引量：1
8姜晖,肖迪.Ncut聚类与增量支持向量机的SMB智能建模[J].计算机工程与设计,2015,36(7):1891-1895. 被引量：2
9王辉,于婧.几种经典的策略梯度算法性能对比[J].电脑知识与技术（过刊）,2014,20(10X):6937-6941. 被引量：1
10刘志荣,姜树海,袁雯雯,史晨辉.基于深度Q学习的移动机器人路径规划[J].测控技术,2019,38(7):24-28. 被引量：23

同被引文献20

1SUTTON R S, BARTO A G. Introduction to reinforce- ment learning IM]. Cambridge: MIT, 1998.
2PETERS J, SCHAAL S. Natural actor-critic [J]. Neu- rocomputing, 2008, 71(7): 1180- 1190.
3GRONDMAN I, BUSONIU L, LOPES G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients ['[1. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6) : 1291 - 1307.
4BHATNAGAR S, SUTTON R S, GHAVAMZADEH M, et al. Natural actor:critic algorithms [J]. Automati- ca, 2009, 45(11).. 2471-2482.
5SUTTON R S. Learning to predict by the methods of temporal differences [J]. Machine Learning, 1988, 3(1): 9 -44.
6ADAM S, BUSONIU L, BABUSKA R. Experience replay for rea[time reinforcement learning control [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42 ( 2): 201 - 212.
7BRADTKE S J, BARTO A G. Linear least-squares algo- rithms for temporal difference learning [J]. Machine Learning, 1996, 22(1/2/3) .. 33 - 57.
8BOYAN J A. Technical update: least-squares temporal difference learning [J]. Machine Learning, 2002, 49(2/ 3) : 233- 246.
9DANN C, NEUMANN G, PETERS J. Policy evalua- tion with temporal differences: a survey and compari- son [J]. The Journal of Machine Learning Research, 2014, 15(1): 809-883.
10GEIST M, PIETQUIN O. Revisiting natural actor-critics with value function approximation [M'] ff Mod- eling DecJsJorts for Artificial Intelligence. Berlin: Springer, 2010 : 207 - 218.

引证文献1

1王国芳,方舟,李平.基于批量递归最小二乘的自然Actor-Critic算法[J].浙江大学学报（工学版）,2015,49(7):1335-1342. 被引量：3

二级引证文献3

1韩玲,卢延辉,安颖,田丽媛.基于容错理论无级变速器故障诊断分类[J].浙江大学学报（工学版）,2016,50(10):1927-1936. 被引量：1
2徐圆,黄兵明,贺彦林.基于改进ELM的递归最小二乘时序差分强化学习算法及其应用[J].化工学报,2017,68(3):916-924. 被引量：6
3甘家梁,刘桂涛,范晨,南德,万兴.基于RLS优化模糊控制DC-DC转换器的仿真研究[J].湖北工程学院学报,2018,38(3):80-84.

1陈贺明,王彩玲.MFC中设计模式(Design Pattern)简析[J].河南广播电视大学学报,2006,19(3):50-52.
2刘全,杨旭东,荆玲,肖飞.基于多Agent并行采样和学习经验复用的E^3算法[J].吉林大学学报（工学版）,2013,43(1):135-140. 被引量：2
3朱斐,刘全,傅启明,伏玉琛.一种用于连续动作空间的最小二乘行动者-评论家方法[J].计算机研究与发展,2014,51(3):548-558. 被引量：9
4王学宁,陈伟,张锰,徐昕,贺汉根.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(1):16-24. 被引量：8
5金玉净,朱文文,伏玉琛,刘全.基于Tile Coding编码和模型学习的Actor-Critic算法[J].计算机科学,2014,41(6):239-242. 被引量：3
6王辉,于婧.几种经典的策略梯度算法性能对比[J].电脑知识与技术（过刊）,2014,20(10X):6937-6941. 被引量：1
7冯湘,秦拯,谷垒,王雷.一种基于类紧密度与相关度的聚类算法[J].微计算机信息,2011,27(2):178-179.
8朱文文,金玉净,伏玉琛,宋绪文.连续空间的递归最小二乘行动者—评论家算法[J].计算机应用研究,2014,31(7):1994-1997. 被引量：2
9童海峰.基于DFL的软件Agent的进化[J].苏州大学学报（工科版）,2005,25(1):36-40.
10Tomohiro Iwasa,Yudai Kato,Shun Shiramatsu,Tadaehika Ozono,Toramatsu Shintani.Linked Data-based Slide Repository： The Episodic Slide Retrieval Using the Episodic Keyword Networks[J].Journal of Control Science and Engineering,2016,4(1):36-49.

华南理工大学学报（自然科学版）

2012年第6期

浏览历史

内容加载中请稍等...

采用经验复用的高效强化学习控制方法被引量：1

参考文献13

二级参考文献43

共引文献11

同被引文献20

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

采用经验复用的高效强化学习控制方法 被引量：1

参考文献13

二级参考文献43

共引文献11

同被引文献20

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

采用经验复用的高效强化学习控制方法被引量：1