期刊文献+

基于深度逆向强化学习的行动序列规划问题研究 被引量:7

Research on course of action planning based on deep inversereinforcement learning
下载PDF
导出
摘要 针对深度强化学习在解决序贯决策任务中严重依赖回报函数,而回报函数又存在着反馈稀疏和反馈延迟等问题,论文提出了基于深度逆向强化学习方法的行动序列生成与优化方法,通过专家示例轨迹数据重构回报函数,实现高质量示例轨迹数据中隐性专家经验的获取和利用,挖掘数据背后的规律。然后将重构的回报函数与环境固有的回报函数进行奖赏塑型,生成的新的回报函数能够更加及时、准确地对智能实体的行为给予反馈,大幅加速了强化学习的收敛速度。 Deep reinforcement learning relies heavily on the reward function in solving sequential decision tasks, and the reward function faces the problems of sparse feedback and delayed feedback. In this paper, a method of generating and optimizing action sequence based on deep inverse reinforcement learning method is proposed, and the reward function is reconstructed by expert demonstrations to achieve high-quality display. The acquisition and utilization of implicit expert experience in demonstration trajectory data are exemplified, and the potential laws behind the demonstrations are mined. Then the reconstructed reward function is merged with the inherent return function of the environment. The new reward function can give more timely and accurate feedback to the behavior of intelligent entities, and can greatly accelerate the convergence speed of reinforcement learning.
作者 陈希亮 曹雷 沈驰 CHEN Xiliang;CAO Lei;SHEN Chi(College of Command and Control Engineering,Army Engineering University, Nanjing 210007, China;The 28th Research Institute of China Electronic Science and Technology Group Corporation, Nanjing 210007, China)
出处 《国防科技》 2019年第4期55-61,共7页 National Defense Technology
关键词 深度强化学习 作战行动序列 智能化战争 deep reinforcement learning course of action planning smart warfare
分类号 E917 [军事]
  • 相关文献

参考文献6

二级参考文献168

  • 1唐金国.美军任务规划系统的现状、发展和关键技术[J].军事运筹与系统工程,2003,17(3):62-64. 被引量:22
  • 2阳东升,张维明,刘忠,胡剑文.组织过程策略优化的案例分析与求解[J].系统仿真学报,2005,17(7):1648-1654. 被引量:15
  • 3张广军,邓剑,柳少军.联合任务计划决策支持系统研究与实现[J].军事运筹与系统工程,2006,20(3):13-17. 被引量:4
  • 4Levchuk G M, Levchuk Y N, Luo J, et al. Normative design of organizations. I. Mission planning [J]. IEEE Transactions on Systems, Man and Cybernetics Part A: Systems and Humans ,2002, 32 (3) :346 - 359.
  • 5Belfares L, Klibi W, Lo N, et al. Multi-objectives tabu search based algorithm for progressive resource allocation [J]. European Journal of Operational Research, 2007, 177 ( 3 ) : 1779 - 1799.
  • 6Lambrechts O, Demeulemeester E, Herroelen W. Time slack- based techniques for robust project scheduling subject to resource uncertainty [J]. Annals of Operations Research, 2011, 186( 1 ) :443 -464.
  • 7Lambrechts O, Demeulemeester E, Herroelen W. Proaetive and reactive strategies for resource-constrained project scheduling with uncertain resource availabilities [J]. Journal of Scheduling,2008, 11 (2) :121 - 136.
  • 8Van de Vonder S, Demeulemeester E, Herroelen W. A classification of predictive-reactive project scheduling procedures [J]. Journal of Scheduling,2007, 10(3) : 195 -207.
  • 9Deb K, Pratap A, Agarwal S, et al. A fast and elitist muhiobjective genetic algorithm: NSGA-II[J]. IEEE Transactions on Evolutionary Computation, 2002, 6(2): 182-197.
  • 10Alcaraz J, Maroto C. A robust genetic algorithm for resource allocation in project sCheduling [J]. Annals of Operations Research,2001, 102( 1 ) :83 - 109.

共引文献177

同被引文献44

引证文献7

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部