期刊文献+

双裁切近端策略优化算法

Proximal Policy Optimization with Double Clipping Boundaries
下载PDF
导出
摘要 近端策略优化(proximal policy optimization,PPO)是一种稳定的深度强化学习算法,该算法的关键点之一是使用裁切后的代理目标限制更新步长.实验发现当使用经验最优的裁切系数时,KL散度(Kullback-Leibler divergence)无法被确立上界,这有悖于置信域优化理论.本文提出一种改进的双裁切近端策略优化算法(proximal policy optimization with double clipping boundaries,PPO-DC).该算法通过基于概率的两段裁切边界调整KL散度,将参数限制在置信域内,以保证样本数据得到充分利用.在多个连续控制任务中,PPO-DC算法取得了好于其他算法的性能. Proximal policy optimization(PPO)is a stable deep reinforcement learning algorithm.The key process of the algorithm is to use clipped surrogate targets to limit step size updates.Experiments have found that when a clipping coefficient with optimal experience is employed,the upper bound of Kullback-Leibler(KL)divergence cannot be determined.This phenomenon is against the optimization theory of trust region.In this study,an improved PPO with double clipping boundaries(PPO-DC)algorithm is proposed.The algorithm adjusts the KL divergence based on two probability-based clipping boundaries and limits parameters to the trust region,so as to ensure that the sample data are fully utilized.In several continuous control tasks,the PPO-DC algorithm achieves better performance than other algorithms.
作者 张骏 王红成 ZHANG Jun;WANG Hong-Cheng(School of Electrical Engineering and Intelligentization,Dongguan University of Technology,Dongguan 523808,China;School of Computer Science and Technology,Dongguan University of Technology,Dongguan 523808,China)
出处 《计算机系统应用》 2023年第4期177-186,共10页 Computer Systems & Applications
基金 广东省普通高校重点科研平台和项目(2020ZDZX3075)。
关键词 强化学习 策略梯度 近端策略优化 裁切机制 reinforcement learning policy gradient(PG) proximal policy optimization(PPO) clipping mechanism
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部