期刊文献+

基于乐观探索的双延迟深度确定性策略梯度

Twin delayed deep deterministic policy gradient based on optimistic exploration
下载PDF
导出
摘要 双延迟深度确定性策略梯度是深度强化学习的一个主流算法,是一种无模型强化学习,已成功应用于具有挑战性的连续控制任务中。然而,当环境中奖励稀疏或者状态空间较大时,双延迟深度确定性策略梯度的样本效率较差,环境探索能力较弱。针对通过双Q值函数的下界确定目标函数带来的低效探索问题,提出一种基于乐观探索的双延迟深度确定性策略梯度(TD3-OE)。首先,从双Q值函数出发,分析取下界会使得探索具有一定的悲观性;然后,利用高斯函数和分段函数分别对双Q值函数进行拟合;最后,利用拟合Q值函数和目标策略构造出探索策略,指导智能体在环境中进行探索。探索策略能够避免智能体学习到次优策略,从而有效解决低效探索的问题。该文在基于MuJoCo物理引擎的控制平台上将所提算法与基准算法进行试验对比,验证了所提算法的有效性。试验结果表明:所提算法在奖励、稳定性和学习速度等指标上均达到或超过其他基础强化学习算法。 Twin delayed deep deterministic policy gradient is a mainstream algorithm for deep reinforcement learning,and a model-free reinforcement learning that has been successfully applied to challenging continuous control tasks.However,when the rewards are sparse or the state space is large in the environment,the sample efficiency of the twin delayed deep deterministic policy gradient is poor and the environment exploration ability is weak.Aiming at the problem of inefficient exploration caused by determining the objective function through the lower bound of the double Q-value function,a twin delayed deep deterministic policy gradient based on optimistic exploration(TD3-OE)is proposed.First,starting from the double Q-value function,it is analyzed that taking the lower bound will make the exploration somewhat pessimistic.Then,the Gaussian function and the piecewise function are used to fit the double Q-value function respectively.Finally,the exploration policy is constructed by fitting the Q-value function and the target policy to guide the agent to explore in the environment.The exploration policy can prevent the agent from learning sub-optimal policies,thus effectively solving the problem of inefficient exploration.This paper compares the proposed algorithm with the benchmark algorithm on the control platform based on the MuJoCo physics engine to verify the effectiveness of the proposed algorithm.The experimental results show that the proposed algorithm achieves or exceeds other basic reinforcement learning algorithms in terms of indicators such as reward,stability and learning speed.
作者 王浩宇 张衡波 程玉虎 王雪松 Wang Haoyu;Zhang Hengbo;Cheng Yuhu;Wang Xuesong(School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116,China)
出处 《南京理工大学学报》 CAS CSCD 北大核心 2024年第3期300-309,共10页 Journal of Nanjing University of Science and Technology
基金 国家自然科学基金(61976215,62176259) 江苏省自然科学基金(BK20221116) 江苏省卓越博士后计划(2022ZB530)。
关键词 深度强化学习 双延迟深度确定性策略梯度 探索策略 乐观探索 deep reinforcement learning twin delayed deep deterministic policy gradient exploration policy optimistic exploration
  • 相关文献

参考文献2

二级参考文献11

共引文献527

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部