期刊文献+

多步积累奖励的双重时序Q网络算法

Double time-series Q network algorithm with multi-step accumulation reward
下载PDF
导出
摘要 车辆行驶控制决策是无人驾驶的核心技术,现有基于深度强化学习的无人驾驶控制决策算法存在处理数据效率低、无法有效提取状态间时序特征等问题.因此本文提出了一种基于多步积累奖励的双重时序Q网络算法.首先,设计了一种多步积累奖励方法,该方法对未来多步即时奖励的累加和进行均值化,与当前即时奖励共同作用于智能体的控制策略,并在奖励函数中突出当前即时奖励的主导影响.接着设计了一种长短期记忆网络和卷积神经网络相结合的时序网络结构,增强智能体对数据间时序特征的捕获能力.实验结果验证了时序网络和多步积累奖励方法有助于提升智能体收敛速度,其中DQN, DDQN使用时序网络后,收敛速度分别提升了21.9%, 26.8%;本文算法在Carla仿真平台典型的Town01, Town02场景中的控制得分比DDQN, TD3算法分别高了36.1%, 24.6%,以及在复杂的Town03场景中针对不同线路表现出了更好的泛化性能.这些结果都表明本文算法能够有效的提升数据利用效率,并具备良好的控制能力和泛化能力. Vehicle driving control decision-making is the core technology of unmanned driving. The existing unmanned driving control decision-making algorithm based on deep reinforcement learning has problems such as low data processing efficiency and ineffective extraction of sequential features between states. Therefore, this paper proposes a double time-series Q network algorithm based on multi-step accumulation of rewards. First, a multi-step accumulation reward method is designed, which averages the cumulative sum of multi-step instant reward, and works with the current instant reward acts on the control strategy of the intelligence. At the same time, the main influencing factors of current reward are highlighted in the reward function. Then, a time-series network structure combining long short-term memory and convolutional neural network is designed to enhance the ability of agent to capture time series features between data frames. The experimental results show that the sequential network and the multi-step accumulation reward method can improve the convergence speed of the agent. After adding the time series network to DQN and DDQN, their convergence speeds are increased by 21.9%and 26.8%, respectively. Compared with DDQN and TD3, the control scores of the proposed algorithm in typical scenes of Town01 and Town02 of Carla simulation platform are increased by 36.1% and 24.6%, respectively. In addition, in the complex Town03 scene, the proposed algorithm shows better generalization performance for different routes. These results show that the proposed algorithm can effectively improve the efficiency of data utilization, and has good control ability and generalization ability.
作者 朱威 谯先锋 陈艺楷 何德峰 ZHU Wei;QIAO Xian-feng;CHEN Yi-kai;HE De-feng(Zhejiang University of Technology,School of Information Engineering,Hangzhou Zhejiang 310023,China)
出处 《控制理论与应用》 EI CAS CSCD 北大核心 2022年第2期222-230,共9页 Control Theory & Applications
基金 浙江省自然科学基金项目(LY21F010009) 国家自然科学基金项目(61773345) 汽车仿真与控制国家重点实验室开放基金项目(20171103)资助。
关键词 深度强化学习 无人车 多步积累奖励 时序网络 数据利用率 deep reinforcement learning unmanned vehicles multi-step reward time-series network data utilization
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部