摘要
针对深度强化学习算法在部分可观测环境中面临的稀疏奖励、信息缺失等问题,提出一种结合好奇心模块与自模仿学习的近端策略优化算法。该算法利用随机网络来生成探索过程中的经验样本数据,然后利用优先经验回放技术选取高质量样本,通过自模仿学习对优秀的序列轨迹进行模仿,并更新一个新的策略网络用于指导探索行为。在Minigrid环境中设置了消融与对比实验,实验结果表明,所提算法在收敛速度上具有明显优势,并且能够完成更为复杂的部分可观测环境探索任务。
In allusion to the problems of sparse rewards and missing information faced by deep reinforcement learning algorithm in partially observable environments,a proximal policy optimization algorithm combining curiosity module and self-imitation learning(SIL)is proposed.In this algorithm,the random network is used to generate empirical sample data during the exploration process,and then the priority experience replay technology is used to select high-quality samples.The excellent sequence trajectories are imitated by means of SIL,and a new policy network is updated to guide the exploration behavior.The ablation and comparison experiments were performed in the Minigrid environment.The experimental results show that the proposed algorithm has a significant advantage in convergence speed and can complete more complex exploration tasks of partially observable environments.
作者
吕相霖
臧兆祥
李思博
邹耀斌
LÜXianglin;ZANG Zhaoxiang;LI Sibo;ZOU Yaobin(Hubei Key Laboratory of Intelligent Vision Monitoring for Hydropower Engineering,China Three Gorges University,Yichang 443002,China;School of Computer and Information,China Three Gorges University,Yichang 443002,China)
出处
《现代电子技术》
北大核心
2024年第16期137-144,共8页
Modern Electronics Technique
基金
国家自然科学基金项目(61502274)
湖北省自然科学基金项目(2015CFB336)
关键词
好奇心模块
自模仿学习
深度强化学习
近端策略优化
随机网络
优先经验回放
curiosity module
self-imitation learning
deep reinforcement learning
proximal policy optimization
random network
priority experience replay