期刊文献+

几种经典的策略梯度算法性能对比 被引量:1

The Comparison of Policy Gradient Algorithms in Reinforcement Learning
下载PDF
导出
摘要 策略梯度函数是基于直接策略搜索的方法。它把策略参数化,并且估算优化指标相对于策略参数的梯度,然后利用该梯度来调整这些参数,最后可以获得局部最优或者局部最优策略。所以这样得到的策略可以是随机性策略也可是确定性策略。通过自主开发的Gridworld策略梯度实验平台,对经典GPOMDP、NAC和基于TD(λ)的策略梯度算法的收敛性能进行了对比分析。 The classical gradient policy function is based on direct policy searching method, in which the policy is approximated with respect to the optimization of policy gradient parameters to get a local optimal strategy. GPOMDP, NAC and TD( λ) experiments are simulated with Gridworld simulation platform. The converge benchmark shows the performance of TD( λ) algorithm by help of value functions is superior to the others.
作者 王辉 于婧
出处 《电脑知识与技术(过刊)》 2014年第10X期6937-6941,6944,共6页 Computer Knowledge and Technology
关键词 强化学习 策略梯度 收敛性 仿真实验 reinforcement learning policy gradient convergence simulation experiments
  • 相关文献

参考文献4

  • 1Jan Peters,Stefan Schaal.Natural Actor-Critic[J].Neurocomputing.2008(7)
  • 2王学宁,陈伟,张锰,徐昕,贺汉根.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(1):16-24. 被引量:8
  • 3Ronald J. Williams.Simple statistical gradient-following algorithms for connectionist reinforcement learning[J].Machine Learning (-).1992(3-4)
  • 4Richard S. Sutton.Learning to predict by the methods of temporal differences[J].Machine Learning.1988(1)

二级参考文献36

  • 1[2]SUTTON R,BARTO A.Reinforcement learning,an introduetion[M].MIT Press,1998.
  • 2[3]SINGH S P.Learning to solve Markovian decision processes[D].University of Massachusetts,1994.
  • 3[4]ROY B V.Learning and value function approximation in complex decision processes[M].MIT Press,1998.
  • 4[5]WATKINS C.Learning from delayed rewards[D].Cambrideg:University of Cambridge,1989.
  • 5[6]HUMPHRYS M.Action selection methods using reinforcement learning[D].Cambrideg:University of Cambridge,1996.
  • 6[7]BERTSEKAS D P,TSITSIKLIS J N.Neuro-dynamic programming[M].Athena Scientific,Belmont,Mass.,1996.
  • 7[8]SUTTON R S,MCALLESTER D,SINGH S,et al.Policy gradient methods for reinforcement learning with function approximation[A].In:Advafices in Neural Information Processing Systems[C].Denver,USA,2000.
  • 8[9]BAIRD L C.Residual algorithms:reinforcement learning with function approximation[A].In:Proc.Of the 12#Int.Conf.on Machine Learning[C].San Francisco,1995.
  • 9[10]TSITSIKLIS J N,ROY V B.Feature-based methods for large scale dynamic programming[J].Machine Learning,1996(22):59-94.
  • 10[12]BAXTER J,BARTLETT P L.Infinite-horizon policygradient estimation[J].Journal of Artificial Intelligence Research,2001(15):319-350.

共引文献8

同被引文献1

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部