摘要
在视距内空战机动决策中,以光电传感器和雷达为代表的机载感知设备易受敌方干扰或气象因素等影响,产生态势感知误差。深度强化学习(DRL)在空战机动决策中虽已取得了重要进展,但现有方法并未考虑空战态势感知误差对DRL的影响。由于状态空间是连续且高维的,态势感知误差会影响状态估计的精度和准确性,进而影响DRL的训练速度及决策效果。针对上述问题,提出一种基于门控循环单元(GRU)提取态势特征的近端策略优化算法(GPPO)。首先,在近端策略优化算法(PPO)基础上引入门控循环单元来融合前序态势信息,提取连续态势序列之间的隐藏特征。随后,通过优势态势解算单元压缩DRL的状态空间维度,从而降低训练难度,并设计一种量化优势的奖励塑造(RS)方法来引导DRL训练加速收敛。最后,定义并描述了视距内空战的相对态势模型,通过设计和引入态势感知误差量,搭建具备态势感知误差的空战仿真环境,并在不同感知误差强度及不同敌我初始态势等多种场景下进行仿真对比实验。仿真结果表明:GPPO算法在具备态势感知误差的多种视距内空战场景里均能有效完成空战优势机动决策,使用GPPO和量化优势RS方法的模型训练收敛速度和机动决策性能均显著优于基础强化学习算法,有效提高了无人机面对态势感知误差时的空战机动决策能力。
In the maneuvering decision-making process of within-visual-range air combat,onboard sensing equipment,represented by photoelectric sensors and radars,is susceptible to enemy jamming or meteorological factors,resulting in situational perception errors.Although deep reinforcement learning(DRL)has made significant progress in air combat maneuvering decision-making,existing methods do not consider the influence of situational perception errors.Since the state space is continuous and high-dimensional,situational perception errors affect the precision and accuracy of state estimation,impacting DRL’s training speed and decision-making performance.In response to this issue,an algorithm based on Proximal Policy Optimization(PPO)is proposed using features extracted by a Gated Recurrent Unit(GRU).The GRU is introduced to fuse the previous state sequence and extract the hidden features in the postures,initially utilizing the PPO algorithm,and a dominant posture-solving unit is set to compress the state space dimension,reducing the training difficulty.The reward shaping(RS)method,which quantifies the advantage,is subsequently designed to guide the faster convergence of DRL training.Finally,the relative situation model of within-visual-range air combat is defined and described.An air combat simulation environment with situational awareness errors is built by designing and introducing situational perception errors,and simulation and comparison experiments are conducted under various scenarios,such as different intensities of perception error and different initial situations of the enemy or foe.The simulation results showed that the GPPO algorithm can effectively complete air combat maneuvering decisions in various within-visual-range air combat scenarios with situational perception errors.The model training convergence speed and maneuvering decision performance using the GPPO and RS method,which quantifies the advantage,are significantly better than basic reinforcement learning algorithms.Despite situational perception errors,this effectively improves the UAV’s air combat maneuvering decisionmaking ability.
作者
田成滨
李辉
陈希亮
吴冯国
TIAN Chengbin;LI Hui;CHEN Xiliang;WU Fengguo(College of Computer Science,Sichuan University,Chengdu 610065,China;Nation Key Laboratory of Fundamental Science on Synthetic Vision,Sichuan University,Chengdu 610065,China;College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China)
出处
《工程科学与技术》
EI
CAS
CSCD
北大核心
2024年第6期270-282,共13页
Advanced Engineering Sciences
基金
国家自然科学基金重点项目(U20A20161)
国家自然科学基金项目(62273356)。
关键词
深度强化学习
视距内空战
机动决策
感知误差
奖励塑造
无人机
deep reinforcement learning
within-visual-range air combat
maneuvering decision
perceptual error
reward shaping
UAV