期刊文献+

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation 被引量:3

原文传递
导出
摘要 Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning.However,existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces.This paper proposes a novel Dyna variant,called Dyna-LSTD-PA,aiming to handle problems with continuous action spaces.Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD)and policy approximation.Dyna-LSTD-PA consists of two simultaneous,interacting processes.The learning process determines the probability distribution over action spaces using the Gaussian distribution;estimates the underlying value function,policy,and model by linear representation;and updates their parameter vectors online by LSTD(,t).The planning process updates the parameter vector of the value function again by using ofttine LSTD(2).Dyna-LSTD-PA also uses the Sherman-Morrison formula to improve the efficiency of LSTD(,t),and weights the parameter vector of the value function to bring the two processes together.Theoretically,the global error bound is derived by considering approximation,estimation,and model errors.Experimentally,Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate,success rate,and stability performance on four benchmark RL problems.
出处 《Frontiers of Computer Science》 SCIE EI CSCD 2019年第1期106-126,共21页 中国计算机科学前沿(英文版)
  • 相关文献

参考文献1

二级参考文献9

  • 1Dimitri P.BERTSEKAS.Approximate policy iteration:a survey and somenew methods[J].控制理论与应用(英文版),2011,9(3):310-335. 被引量:6
  • 2Derong Liu,Ding Wang,Xiong Yang.An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs[J]. Information Sciences . 2013
  • 3Ding Wang,Derong Liu,Qinglai Wei,Dongbin Zhao,Ning Jin.Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming[J]. Automatica . 2012 (8)
  • 4Amir-massoud Farahmand,Csaba Szepesvári.Model selection in reinforcement learning[J]. Machine Learning . 2011 (3)
  • 5András Antos,Csaba Szepesvári,Rémi Munos.Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path[J]. Machine Learning . 2008 (1)
  • 6A. Nedi?,D. P. Bertsekas.Least Squares Policy Evaluation Algorithms with Linear Function Approximation[J]. Discrete Event Dynamic Systems . 2003 (1-2)
  • 7Dirk Ormoneit,?aunak Sen.Kernel-Based Reinforcement Learning[J]. Machine Learning . 2002 (2-3)
  • 8Justin A. Boyan.Technical Update: Least-Squares Temporal Difference Learning[J]. Machine Learning . 2002 (2-3)
  • 9Steven J. Bradtke,Andrew G. Barto.Linear Least-Squares Algorithms for Temporal Difference Learning[J]. Machine Learning . 1996 (1-3)

共引文献1

同被引文献9

引证文献3

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部