摘要
针对传统强化学习在空战环境下探索能力差和奖励稀疏的问题,提出了一种基于课程学习的分布式近端策略优化(curriculum learning distributed proximal policy optimization,CLDPPO)强化学习算法。嵌入包含专家经验知识的奖励函数,设计了离散化的动作空间,构建了局部观测与全局观测分离的演员评论家网络。通过为无人机制定进攻、防御以及综合课程,让无人机从基本课程由浅入深开始学习作战技能,阶段性提升无人机作战能力。实验结果表明:以课程学习方式训练的无人机能以一定的优势击败专家系统和主流强化学习算法,同时具有空战战术的自我学习能力,有效改善稀疏奖励的问题。
To address the limited exploration capabilities and sparse rewards of conventional reinforcement learning methods in air combat environment,a curriculum learning distributed proximal policy optimization(CLDPPO)reinforcement learning algorithm is proposed.A reward function informed by professional empirical knowledge is integrated,a discrete action space is developed,and a global observation and local value and decision network featuring separated global and local observations is established.A methodology for unmanned aerial vehicles UAVs is presented to acquire combat expertise through a sequence of fundamental courses that progressively intensify in their offensive,defensive,and comprehensive content.The experimental results show that the methodology surpasses the specialist system and the other mainstream reinforcement learning algorithms,which has the ability of the autonomous acquisition of air warfare tactics and can enhance the sparse rewards.
作者
祝靖宇
张宏立
匡敏驰
史恒
朱纪洪
乔直
周文卿
Zhu Jingyu;Zhang Hongli;Kuang Minchi;Shi Heng;Zhu Jihong;Qiao zhi;Zhou Wenqing(School of Electrical Engineering,Xinjiang University,Urumqi 830000,China;Department of Precision Instrument,Tsinghua University,Beijing 100084,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
出处
《系统仿真学报》
CAS
CSCD
北大核心
2024年第6期1452-1467,共16页
Journal of System Simulation
关键词
UAVS
空战
稀疏奖励
课程学习
分布式近端策略优化
UAVs
air combat
sparse reward
curriculum learning
distributed proximal policy optimization(DPPO)