摘要
在模型未知的部分可观测马尔可夫决策过程(partially observable Markov decision process,POMDP)下,智能体无法直接获取环境的真实状态,感知的不确定性为学习最优策略带来挑战。为此,提出一种融合对比预测编码表示的深度双Q网络强化学习算法,通过显式地对信念状态建模以获取紧凑、高效的历史编码供策略优化使用。为改善数据利用效率,提出信念回放缓存池的概念,直接存储信念转移对而非观测与动作序列以减少内存占用。此外,设计分段训练策略将表示学习与策略学习解耦来提高训练稳定性。基于Gym-MiniGrid环境设计了POMDP导航任务,实验结果表明,所提出算法能够捕获到与状态相关的语义信息,进而实现POMDP下稳定、高效的策略学习。
In the model unknown partially observable Markov decision process(POMDP),the agent cannot directly access the true state of environment,and the perceptual uncertainty poses challenges for learning the optimal policy.Thus,a dou-ble deep Q-network reinforcement learning algorithm based on the representation of the contrastive predictive coding is proposed.The belief states are modeled explicitly to obtain a compact and efficient history encoding for the policy optimi-zation.To improve data efficiency,the belief replay buffer is introduced to reduce the memory usage by directly storing the belief transition pairs instead of the observation and action sequences.In addition,the phased training strategy is designed for decoupling the representation learning from the policy learning process to improve training stability.The POMDP nav-igation tasks based on the Gym-MiniGrid environment are designed.Experimental results show that the semantic informa-tion related to the state can be captured by the proposed algorithm,which facilitates to achieve stable and efficient policy learning in POMDP.
作者
刘剑锋
普杰信
孙力帆
LIU Jianfeng;PU Jiexin;SUN Lifan(School of Information Engineering,Henan University of Science and Technology,Luoyang,Henan 471023,China)
出处
《计算机工程与应用》
CSCD
北大核心
2023年第6期162-170,共9页
Computer Engineering and Applications
基金
国家部委预研基金(61403120207)
河南省高校科技创新人才资助项目(21HASTIT030)
河南省高等学校青年骨干教师资助项目(2020GGJS073)。
关键词
部分可观测马尔可夫决策过程
表示学习
强化学习
对比预测编码
深度双Q网络
partially observable Markov decision process(POMDP)
representation learning
reinforcement learning
contrastive predictive coding
double deep Q-network