摘要
针对DDPG(deep deterministic policy gradient)在线训练过程中陷入局部极小值及产生大量试错动作和无效数据的问题,提出一种基于离线模型预训练学习的改进DDPG算法。利用已有数据离线训练对象状态模型和价值奖励模型,提前对DDPG中动作网络和价值网络进行预训练学习,减少DDPG前期工作量并提升在线学习的品质。加入DDQN(double deep Q-Learning network)结构解决Q值估计偏高问题。仿真结果中获取平均累积奖励值提升了9.15%,表明改进算法有效提高了DDPG算法效果。
In view of the problems that DDPG(deep deterministic policy gradient)falls into local minimum in the process of online training and a lot of wrong and invalid data generated during the initial training of DDPG network,an improved DDPG algorithm based on off-line model pre-training learning was proposed.The existing data were used to train the object state model and value reward model offline and the action network and value network were pre-trained in DDPG in advance,reducing the amount of work in the early stage of online learning and improving the quality of online learning.Introducing DDQN(double deep Q-Learning network)structure solved the high estimation of Q value.Simulation results show that the average cumulative reward value is increased by 9.15%,which shows that the improved algorithm can effectively improve effects of DDPG algorithm.
作者
张茜
王洪格
倪亮
ZHANG Qian;WANG Hong-ge;NI Liang(School of Computer Science,Zhongyuan University of Technology,Zhengzhou 450007,China)
出处
《计算机工程与设计》
北大核心
2022年第5期1451-1458,共8页
Computer Engineering and Design
基金
河南省科技攻关计划基金项目(222102210281、182102210130)
国家留学基金项目(201908410281)
河南省高校重点科研基金项目(21A520053)。