期刊文献+

采用双经验回放池的噪声流双延迟深度确定性策略梯度算法

Noisy twin delayed deep deterministic policy gradient algorithm using double experience replay buffers
下载PDF
导出
摘要 为了进一步提高双延迟深度确定性策略梯度算法(TD3)的网络探索性能和收敛速度,提出一种采用基于多步优先和重抽样优选机制的双经验回放池的噪声流TD3算法。该算法在策略网络中的每一层添加噪声流以增加参数的随机性,并引入多步优先经验回放池,将多个连续样本组成一个基础单元进行存储,训练时通过多步截断双Q处理实现对值函数的有效逼近,同时增加一个经验回放池采用重抽样优选机制来存储学习价值更大的样本,双经验回放池的设置可弥补样本多样性不足的问题。在OpenAI Gym平台的Walker2d-v2场景中进行仿真实验,结果表明,与对比算法相比,本文算法获得的回报值有明显改善,网络收敛速度也大大加快。 To improve the network’s exploration ability and convergence speed of twin delayed deep deterministic policy gradient(TD3)algorithm,this paper proposes a noise TD3 algorithm using double experience replay buffers based on multi-step prioritized and resampling preferred mechanism.Firstly,noise flow is added to each layer of the policy network to increase the randomness of the parameters.Then a multi-step prioritized experience replay buffer is built,where multiple continuous samples are stored as one basic unit.During network training,the effective approximation of the value function is realized by multi-step clipped double Q-processing.Meanwhile a resampling preferred experience replay buffer is set to store the samples with greater learning value.Double experience replay buffers can make up for the lack of sample’s diversity.Simulation experiments are carried out in Walker2d-v2 scenario of OpenAI Gym software platform.The results show that the average reward of the proposed algorithm is obviously higher than that of other algorithms,and the convergence speed of the network is also greatly improved.
作者 王垚儒 李俊 Wang Yaoru;Li Jun(College of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan 430065,China;Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System,Wuhan University of Science and Technology,Wuhan 430065,China)
出处 《武汉科技大学学报》 CAS 北大核心 2020年第2期147-154,共8页 Journal of Wuhan University of Science and Technology
基金 国家自然科学基金资助项目(61572381) 武汉科技大学智能信息处理与实时工业系统湖北省重点实验室基金资助项目(znxx2018QN06).
关键词 深度确定性策略梯度 TD3算法 深度强化学习 噪声流 多步截断双Q学习 双经验回放池 deep deterministic policy gradient TD3 algorithm deep reinforcement learning noise flow multi-step clipped double Q-learning double experience replay buffers
  • 相关文献

参考文献3

二级参考文献9

共引文献471

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部