摘要
策略梯度函数是基于直接策略搜索的方法。它把策略参数化,并且估算优化指标相对于策略参数的梯度,然后利用该梯度来调整这些参数,最后可以获得局部最优或者局部最优策略。所以这样得到的策略可以是随机性策略也可是确定性策略。通过自主开发的Gridworld策略梯度实验平台,对经典GPOMDP、NAC和基于TD(λ)的策略梯度算法的收敛性能进行了对比分析。
The classical gradient policy function is based on direct policy searching method, in which the policy is approximated with respect to the optimization of policy gradient parameters to get a local optimal strategy. GPOMDP, NAC and TD( λ) experiments are simulated with Gridworld simulation platform. The converge benchmark shows the performance of TD( λ) algorithm by help of value functions is superior to the others.
出处
《电脑知识与技术(过刊)》
2014年第10X期6937-6941,6944,共6页
Computer Knowledge and Technology
关键词
强化学习
策略梯度
收敛性
仿真实验
reinforcement learning
policy gradient
convergence
simulation experiments