摘要
提出了一种新的 on- policy强化学习算法 ,其基本思想是按照一定学习策略 ,利用 k(k >1)步的信息来估计 TD (λ)回报值 ,从而加快对行动最优值估计的更新。更新速度比 SARSA (0 )算法快 ,但不象 SARSA (λ)
In this paper, we propose a new on policy reinforcement learning algorithm The main principle of the algorithm is based on a policy That is, using the information of k(k>1) estimates the return value of TD(λ), leading to a faster renewal of estimating the optimal value of actions The renewal speed is faster than the algorithm of SARSA(0) but less calculation than SARSA(λ)
出处
《广西工学院学报》
CAS
2002年第1期1-4,共4页
Journal of Guangxi University of Technology