摘要
离线强化学习通过减小分布偏移实现了习得策略向行为策略的逼近,但离线经验缓存的数据分布往往会直接影响习得策略的质量.通过优化采样模型来改善强化学习智能体的训练效果,提出两种离线优先采样模型:基于时序差分误差的采样模型和基于鞅的采样模型.基于时序差分误差的采样模型可以使智能体更多地学习值估计不准确的经验数据,通过估计更准确的值函数来应对可能出现的分布外状态.基于鞅的采样模型可以使智能体更多地学习对策略优化有利的正样本,减少负样本对值函数迭代的影响.进一步,将所提离线优先采样模型分别与批约束深度Q学习(Batch-constrained deep Q-learning,BCQ)相结合,提出基于时序差分误差的优先BCQ和基于鞅的优先BCQ.D4RL和Torcs数据集上的实验结果表明:所提离线优先采样模型可以有针对性地选择有利于值函数估计或策略优化的经验数据,获得更高的回报.
Offline reinforcement learning algorithms realize the approximation of learned policy to behavior policy by reducing the distribution shift,but the data distribution of offline experience buffer often directly affects the quality of learned policy.In this paper,two offline prioritized sampling models including temporal difference error-based and martingale-based are proposed to improve the training effect of reinforcement learning agent.The tem-poral difference error-based sampling model enables agents to learn more experience data with inaccurate value es-timation,thus deals with possible out-of-distribution states by estimating more accurate value functions.The mar-tingale-based sampling model enables agents to learn more positive samples beneficial to policy optimization and re-duces the impact of negative samples on value function iteration.Furthermore,the proposed offline prioritized sampling models are combined with the batch-constrained deep Q-learning(BCQ)respectively,to propose tempor-al difference error-based prioritized BCQ and martingale-based prioritized BCQ.Experimental results on D4RL and Torcs datasets show that the proposed two offline prioritized sampling models can be targeted to select the experi-ence data that are conducive to value function estimation or policy optimization,so as to obtain higher rewards.
作者
顾扬
程玉虎
王雪松
GU Yang;CHENG Yu-Hu;WANG Xue-Song(School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2024年第1期143-153,共11页
Acta Automatica Sinica
基金
国家自然科学基金(62176259,62373364)
江苏省重点研发计划项目(BE2022095)资助。
关键词
离线强化学习
优先采样模型
时序差分误差
鞅
批约束深度Q学习
Offline reinforcement learning
prioritized sampling model
temporal difference error
martingale
batch-constrained deep Q-learning(BCQ)