期刊文献+

基于对称扰动采样的Actor-critic算法 被引量:1

Actor-critic algorithms based on symmetric perturbation sampling
原文传递
导出
摘要 针对传统Actor-critic(AC)方法在求解连续空间序贯决策问题时收敛速度较慢、收敛质量不高的问题,提出一种基于对称扰动采样的AC算法框架.首先,框架采用高斯分布作为策略分布,在每一时间步对当前动作均值对称扰动,从而生成两个动作与环境并行交互;然后,基于两者的最大时域差分(TD)误差选取Agent的行为动作,并对值函数参数进行更新;最后,基于两者的平均常规梯度或增量自然梯度对策略参数进行更新.理论分析和仿真结果表明,所提框架具有较好的收敛性和计算效率. When solving the sequential decision-making problems in continuous spaces, the traditional actor-critic(AC)methods are often difficult to get good convergence speed and quality. To overcome the above weakness, an AC algorithm framework, which uses a Gaussian distribution as the policy distribution, is proposed based on the symmetric perturbation sampling. At each time step, the framework generates two actions through two symmetric perturbations on the current action mean, and takes them to interact with the environment in parallel. Then, the framework selects the Agent's behavior action and updates the value-function parameters based on the maximum temporal difference(TD) error, and updates the policy parameters based on the average regular gradient or the average incremental natural gradient. The theoretical analysis and simulation results show that the framework not only has a better convergence performance, but also has a high computational efficiency.
出处 《控制与决策》 EI CSCD 北大核心 2015年第12期2161-2167,共7页 Control and Decision
基金 国家自然科学基金项目(61100118 60671033) 海南省自然科学基金项目(613153)
关键词 Actor-critic方法 对称扰动采样 连续空间 强化学习 Actor-critic method symmetric perturbation sampling continuous space reinforcement learning
  • 相关文献

参考文献16

  • 1Xu X, Hou Z, Lian C, et al. Online learning control using adaptive critic designs with sparse kernel machines[J].IEEE Trans on Neural Networks and Learning Systems, 2013, 24(5): 762-775.
  • 2Sutton R S, Barto A G. Reinforcement learning: An intro- duction[M]. Cambridge: MIT Press, 1998: 151-215.
  • 3Bhatnagar S, Sutton R S, Ghavamzadeh M, et al. Natural actor-critic algorithms[R]. Canada: Department of Computing Science, University of Alberta, 2009.
  • 4Degris T, Pilarski P M, Sutton R S. Model-free reinforce- ment learning with continuous action in practice[C]. 2012 American Control Conf. Montreal: IEEE Press, 2012: 2177-2182.
  • 5Degris T, White M, Sutton R S. Off-policy actor-critic[C]. The 29th Int Conf on Machine Learning. Edinburgh: Omnipress, 2012: 457-464.
  • 6Silver D, Lever G, Heess N, et al. Deterministic policy gradient algorithms[C]. The 31st Int Conf on Machine Learning. Beijing: JMLR W & CP, 2014: 387-395.
  • 7Lee D, Lee J. Incremental receptive field weighted actor- critic[J]. IEEE Trans on Industrial Informatics, 2013, 9(1): 62-71.
  • 8Pazis J, Lagoudakis M G. Binary action search for learning continuous-action control policies[C]. The 26th Int Conf on Machine Learning. Montreal: ACM Press, 2009: 793-800.
  • 9Sehnke F, Osendorfer C, Ruckstieβ T, et al. Parameter- exploring policy gradients[J]. Neural Networks, 2010, 23(4): 551-559.
  • 10Peters J, Schaal S. Natural actor-critic[J]. Neurocomputing, 2008, 71(7/8/9): 1180-1190.

同被引文献10

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部