多步积累奖励的双重时序Q网络算法

Double time-series Q network algorithm with multi-step accumulation reward

下载PDF

导出

摘要车辆行驶控制决策是无人驾驶的核心技术,现有基于深度强化学习的无人驾驶控制决策算法存在处理数据效率低、无法有效提取状态间时序特征等问题.因此本文提出了一种基于多步积累奖励的双重时序Q网络算法.首先,设计了一种多步积累奖励方法,该方法对未来多步即时奖励的累加和进行均值化,与当前即时奖励共同作用于智能体的控制策略,并在奖励函数中突出当前即时奖励的主导影响.接着设计了一种长短期记忆网络和卷积神经网络相结合的时序网络结构,增强智能体对数据间时序特征的捕获能力.实验结果验证了时序网络和多步积累奖励方法有助于提升智能体收敛速度,其中DQN, DDQN使用时序网络后,收敛速度分别提升了21.9%, 26.8%;本文算法在Carla仿真平台典型的Town01, Town02场景中的控制得分比DDQN, TD3算法分别高了36.1%, 24.6%,以及在复杂的Town03场景中针对不同线路表现出了更好的泛化性能.这些结果都表明本文算法能够有效的提升数据利用效率,并具备良好的控制能力和泛化能力. Vehicle driving control decision-making is the core technology of unmanned driving. The existing unmanned driving control decision-making algorithm based on deep reinforcement learning has problems such as low data processing efficiency and ineffective extraction of sequential features between states. Therefore, this paper proposes a double time-series Q network algorithm based on multi-step accumulation of rewards. First, a multi-step accumulation reward method is designed, which averages the cumulative sum of multi-step instant reward, and works with the current instant reward acts on the control strategy of the intelligence. At the same time, the main influencing factors of current reward are highlighted in the reward function. Then, a time-series network structure combining long short-term memory and convolutional neural network is designed to enhance the ability of agent to capture time series features between data frames. The experimental results show that the sequential network and the multi-step accumulation reward method can improve the convergence speed of the agent. After adding the time series network to DQN and DDQN, their convergence speeds are increased by 21.9%and 26.8%, respectively. Compared with DDQN and TD3, the control scores of the proposed algorithm in typical scenes of Town01 and Town02 of Carla simulation platform are increased by 36.1% and 24.6%, respectively. In addition, in the complex Town03 scene, the proposed algorithm shows better generalization performance for different routes. These results show that the proposed algorithm can effectively improve the efficiency of data utilization, and has good control ability and generalization ability.

作者朱威谯先锋陈艺楷何德峰 ZHU Wei;QIAO Xian-feng;CHEN Yi-kai;HE De-feng(Zhejiang University of Technology,School of Information Engineering,Hangzhou Zhejiang 310023,China)

机构地区浙江工业大学信息工程学院

出处《控制理论与应用》 EI CAS CSCD 北大核心 2022年第2期222-230,共9页 Control Theory & Applications

基金浙江省自然科学基金项目(LY21F010009) 国家自然科学基金项目(61773345) 汽车仿真与控制国家重点实验室开放基金项目(20171103)资助。

关键词深度强化学习无人车多步积累奖励时序网络数据利用率 deep reinforcement learning unmanned vehicles multi-step reward time-series network data utilization

分类号 TP183 [自动化与计算机技术—控制理论与控制工程] U463.6 [机械工程—车辆工程]

引文网络
相关文献

1林芷伊.初中英语主题式单元口语教学实践——以Unit 8 It must belong to Carla Section A(2d)为例[J].英语教师,2021,21(23):130-133.
2欧阳卓,周思源,吕勇,谭国平,张悦,项亮亮.基于深度强化学习的无信号灯交叉路口车辆控制[J].计算机科学,2022,49(3):46-51. 被引量：6
3王羽尘,Denis MWABA,于斌.自动驾驶专用车道的宽度模型研究[J].现代交通与冶金材料,2022,2(1):53-60. 被引量：1
4程泱.智能化工厂的结构和建设[J].石油化工自动化,2021,57(S01):103-106. 被引量：3
5Unit 8 It must belong to Carla.[J].时代英语（初中）,2021(6):39-42.
6裴晓飞,莫烁杰,陈祯福,杨波.基于TD3算法的人机混驾交通环境自动驾驶汽车换道研究[J].中国公路学报,2021,34(11):246-254. 被引量：17
7范文贵,李燕.小学生解决万以内退位减法错误类型及影响研究[J].数学教育学报,2021,30(6):32-38. 被引量：3
8吴昊,潘孝兴,赵红民,赵海钟,郑康乐,赵振宙.偏航与变桨工况对风力机尾流特性影响研究[J].能源研究与利用,2022(1):14-18. 被引量：1
9Bai Shuangxia,Song Shaomei,Liang Shiyang,Wang Jianmei,Li Bo,Neretin Evgeny.UAV Maneuvering Decision-Making Algorithm Based on Twin Delayed Deep Deterministic Policy Gradient Algorithm[J].Journal of Artificial Intelligence and Technology,2022,2(1):16-22. 被引量：7
10Chong Tian,Shahil Shaik,Yue Wang.Deep reinforcement learning for shared control of mobile robots[J].IET Cyber-Systems and Robotics,2021,3(4):315-330.

控制理论与应用

2022年第2期

浏览历史

内容加载中请稍等...

多步积累奖励的双重时序Q网络算法

相关作者

相关机构

相关主题

浏览历史