摘要
针对深度强化学习在解决序贯决策任务中严重依赖回报函数,而回报函数又存在着反馈稀疏和反馈延迟等问题,论文提出了基于深度逆向强化学习方法的行动序列生成与优化方法,通过专家示例轨迹数据重构回报函数,实现高质量示例轨迹数据中隐性专家经验的获取和利用,挖掘数据背后的规律。然后将重构的回报函数与环境固有的回报函数进行奖赏塑型,生成的新的回报函数能够更加及时、准确地对智能实体的行为给予反馈,大幅加速了强化学习的收敛速度。
Deep reinforcement learning relies heavily on the reward function in solving sequential decision tasks, and the reward function faces the problems of sparse feedback and delayed feedback. In this paper, a method of generating and optimizing action sequence based on deep inverse reinforcement learning method is proposed, and the reward function is reconstructed by expert demonstrations to achieve high-quality display. The acquisition and utilization of implicit expert experience in demonstration trajectory data are exemplified, and the potential laws behind the demonstrations are mined. Then the reconstructed reward function is merged with the inherent return function of the environment. The new reward function can give more timely and accurate feedback to the behavior of intelligent entities, and can greatly accelerate the convergence speed of reinforcement learning.
作者
陈希亮
曹雷
沈驰
CHEN Xiliang;CAO Lei;SHEN Chi(College of Command and Control Engineering,Army Engineering University, Nanjing 210007, China;The 28th Research Institute of China Electronic Science and Technology Group Corporation, Nanjing 210007, China)
出处
《国防科技》
2019年第4期55-61,共7页
National Defense Technology
关键词
深度强化学习
作战行动序列
智能化战争
deep reinforcement learning
course of action planning
smart warfare