摘要
【目的】针对基于策略的深度强化学习算法中存在的样本不能重复利用、样本利用率低的问题,提出一种有样本重用的阶段性策略梯度算法(phasic policy gradient with sample reuse,SR-PPG).【方法】该算法在阶段性策略梯度算法(phasic policy gradient,PPG)的基础上引入离线数据,从而减小训练的时间成本,使模型能够快速收敛。在这项工作中,SR-PPG将理论上支持的在线策略算法的稳定性优势与离线策略算法的样本效率相结合,开发了适用于离线策略设置的策略改进保证,并将这些界限与阶段性策略梯度算法使用的剪裁机制联系起来。【结果】一系列理论和实验证明,该算法通过有效平衡稳定性和样本效率这两个相互竞争的目标,提供了更好的性能。
【Purposes】The algoritihm of phasic policy gradient with sample reuse(SR-PPG)is proposed to address the problems of non-reuse of samples and low sample utilization in policy-based deep reinforcement learning algorithms.【Methods】In the proposed algorithm,offline data are introduced on the basis of the phasic policy gradient(PPG),thus reducing the time cost of training and enabling the model to converge quickly.In this work,SR-PPG combines the stability advantages of theoretically supported on-policy algorithms with the sample efficiency of off-policy algorithms to develop policy improvement guarantees applicable to off-policy settings and to link these bounds to the clipping mechanism used by PPG.【Findings】A series of theoretical and experimental demonstrations show that this algorithm provides better performance by effectively balancing the competing goals of stability and sample efficiency.
作者
李海亮
王莉
LI Hailiang;WANG Li(College of Data Science,Taiyuan University of Technology,Jinzhong 030600,China)
出处
《太原理工大学学报》
CAS
北大核心
2024年第4期712-719,共8页
Journal of Taiyuan University of Technology
基金
国家自然科学基金区域创新发展联合基金资助项目(U22A20167)
国家重点研发计划(2021YFB3300503)。
关键词
深度强化学习
阶段性策略梯度
样本重用
deep reinforcement learning
phasic policy gradient
sample reuse