期刊文献+

基于轨迹信息量的分层强化学习方法

Hierarchical Reinforcement Learning Method Based on Trajectory Information
下载PDF
导出
摘要 基于option的分层强化学习(The Option-Based Hierarchical Reinforcement Learning,O-HRL)算法具有时序抽象的特点,可以有效处理强化学习中难以解决的长时序、稀疏奖励等复杂问题。目前O-HRL方法的研究主要集中在数据效率提升方面,通过提高智能体的采样效率以及探索能力,来最大化其获得优秀经验的概率。然而,在策略稳定性方面,由于在上层策略引导下层动作的过程中仅仅考虑了状态信息,造成了option信息的利用不充分,进而导致下层策略的不稳定。针对这一问题,提出了一种基于轨迹信息量的分层强化学习(Hierarchical Reinforcement Learning Method Based on Trajectory Information,THRL)方法。该方法利用option轨迹的不同类型信息指导下层动作选择,通过得到的扩展轨迹信息生成推断option。同时引入鉴别器将推断option与原始option作为输入,以获得内部奖励,使得下层动作的选择更符合当前option策略,从而解决下层策略不稳定的问题。将THRL算法以及目前优秀的深度强化学习算法应用于MuJoCo环境问题中,实验结果表明,THRL算法具有更好的稳定性以及性能表现,验证了算法的有效性。 The option-based hierarchical reinforcement learning(O-HRL)algorithm has the characteristics of temporal abstraction,which can effectively deal with complex problems such as long-term temporal order and sparse rewards that are difficult to solve in reinforcement learning.The existing studies of O-HRL methods mainly focus on data efficiency improvement by increa-sing the sampling efficiency as well as the exploration ability of the agent to maximize its probability of obtaining excellent expe-riences.However,in terms of policy stability,the high-level policy guides the low-level action by only considering the state,resulting in the underutilization of option information,which leads to the instability of the low-level policy.To address this problem,a hierarchical reinforcement learning method based on trajectory information(THRL)is proposed.THRL uses different types of information of option trajectories to guide the selection of low-level actions,and also generates inferred options by the obtained extended trajectory information.A discriminator is introduced to use the inferred options and the original options as inputs to obtain internal rewards,which makes the selection of low-level actions more consistent with the current option policy,thus solving the instability problem of low-level policies.The effectiveness of THRL is verified by applying it to the MuJoCo environment,along with the best deep reinforcement learning algorithms,and experimental results show that the THRL algorithm has better stability and performance.
作者 徐亚鹏 刘全 栗军伟 XU Yapeng;LIU Quan;LI Junwei(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China)
出处 《计算机科学》 CSCD 北大核心 2023年第12期314-321,共8页 Computer Science
基金 国家自然科学基金(61772355,61702055,61876217,62176175) 江苏高校优势学科建设工程资助项目。
关键词 OPTION 分层强化学习 轨迹信息 鉴别器 深度强化学习 Option Hierarchical reinforcement learning Trajectory information Discriminator Deep reinforcement learning
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部