期刊文献+

融合对比预测编码的深度双Q网络 被引量:1

Double Deep Q-Network by Fusing Contrastive Predictive Coding
下载PDF
导出
摘要 在模型未知的部分可观测马尔可夫决策过程(partially observable Markov decision process,POMDP)下,智能体无法直接获取环境的真实状态,感知的不确定性为学习最优策略带来挑战。为此,提出一种融合对比预测编码表示的深度双Q网络强化学习算法,通过显式地对信念状态建模以获取紧凑、高效的历史编码供策略优化使用。为改善数据利用效率,提出信念回放缓存池的概念,直接存储信念转移对而非观测与动作序列以减少内存占用。此外,设计分段训练策略将表示学习与策略学习解耦来提高训练稳定性。基于Gym-MiniGrid环境设计了POMDP导航任务,实验结果表明,所提出算法能够捕获到与状态相关的语义信息,进而实现POMDP下稳定、高效的策略学习。 In the model unknown partially observable Markov decision process(POMDP),the agent cannot directly access the true state of environment,and the perceptual uncertainty poses challenges for learning the optimal policy.Thus,a dou-ble deep Q-network reinforcement learning algorithm based on the representation of the contrastive predictive coding is proposed.The belief states are modeled explicitly to obtain a compact and efficient history encoding for the policy optimi-zation.To improve data efficiency,the belief replay buffer is introduced to reduce the memory usage by directly storing the belief transition pairs instead of the observation and action sequences.In addition,the phased training strategy is designed for decoupling the representation learning from the policy learning process to improve training stability.The POMDP nav-igation tasks based on the Gym-MiniGrid environment are designed.Experimental results show that the semantic informa-tion related to the state can be captured by the proposed algorithm,which facilitates to achieve stable and efficient policy learning in POMDP.
作者 刘剑锋 普杰信 孙力帆 LIU Jianfeng;PU Jiexin;SUN Lifan(School of Information Engineering,Henan University of Science and Technology,Luoyang,Henan 471023,China)
出处 《计算机工程与应用》 CSCD 北大核心 2023年第6期162-170,共9页 Computer Engineering and Applications
基金 国家部委预研基金(61403120207) 河南省高校科技创新人才资助项目(21HASTIT030) 河南省高等学校青年骨干教师资助项目(2020GGJS073)。
关键词 部分可观测马尔可夫决策过程 表示学习 强化学习 对比预测编码 深度双Q网络 partially observable Markov decision process(POMDP) representation learning reinforcement learning contrastive predictive coding double deep Q-network
  • 相关文献

参考文献5

二级参考文献62

  • 1Boutilier C, Dean T, Hank S. Decision theoretic planning: structural assumptions and computational leverage[J]. Journal of Artificial Intelligence Research, 1999, 11 : 1 - 49.
  • 2Astrom K J. Optimal control of Markov decision processes with incomplete state estimation[J]. Journal of Mathematical Analysis and Application, 1965, 10:174 - 205.
  • 3Eagle J. The optimal search for a moving target when the search path is constrained[J]. Operations Research, 1984, 32: 1107- 1115.
  • 4Sondik E J. The optimal control of partially observable Markov processes over the infinite horizon: discounted case[J]. Operations Research, 1978, 26:282 - 304.
  • 5Cassandra A R. A survey of POMDP applications[C]//Proceedings of AAAI Full Symposium on Planning with Partially Observable Markov Decision Processes. 1998, 17- 24.
  • 6White C C. Seherer W T. Solution procedures for parlially observed Markov decision processes [J]. Operations Research. 1989, 37(5): 791-797.
  • 7Smallwood R D. Sondik E J. Optimal control of partially observable processes over the finite horizon[J]. Operations Research. 1973, 21: 1071- 1088.
  • 8Sondik E J. The optimal control of partially observable Markov processes[D]. Department of Electrical Engineering, Stanford University, Stanford, CA, 1971.
  • 9Monahan G. A survey of partially observable Markov decision processes: theory, models, and algorithm[J]. Manage Science, 1982, 28(1):1-16.
  • 10Cheng H. Algorithms for partially observed Markov decision processes[D]. Faculty of Commerce and Business Administration. University of British Columbia, 1988.

共引文献97

同被引文献12

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部