基于轨迹信息量的分层强化学习方法

Hierarchical Reinforcement Learning Method Based on Trajectory Information

下载PDF

导出

摘要基于option的分层强化学习(The Option-Based Hierarchical Reinforcement Learning,O-HRL)算法具有时序抽象的特点,可以有效处理强化学习中难以解决的长时序、稀疏奖励等复杂问题。目前O-HRL方法的研究主要集中在数据效率提升方面,通过提高智能体的采样效率以及探索能力,来最大化其获得优秀经验的概率。然而,在策略稳定性方面,由于在上层策略引导下层动作的过程中仅仅考虑了状态信息,造成了option信息的利用不充分,进而导致下层策略的不稳定。针对这一问题,提出了一种基于轨迹信息量的分层强化学习(Hierarchical Reinforcement Learning Method Based on Trajectory Information,THRL)方法。该方法利用option轨迹的不同类型信息指导下层动作选择,通过得到的扩展轨迹信息生成推断option。同时引入鉴别器将推断option与原始option作为输入,以获得内部奖励,使得下层动作的选择更符合当前option策略,从而解决下层策略不稳定的问题。将THRL算法以及目前优秀的深度强化学习算法应用于MuJoCo环境问题中,实验结果表明,THRL算法具有更好的稳定性以及性能表现,验证了算法的有效性。 The option-based hierarchical reinforcement learning(O-HRL)algorithm has the characteristics of temporal abstraction,which can effectively deal with complex problems such as long-term temporal order and sparse rewards that are difficult to solve in reinforcement learning.The existing studies of O-HRL methods mainly focus on data efficiency improvement by increa-sing the sampling efficiency as well as the exploration ability of the agent to maximize its probability of obtaining excellent expe-riences.However,in terms of policy stability,the high-level policy guides the low-level action by only considering the state,resulting in the underutilization of option information,which leads to the instability of the low-level policy.To address this problem,a hierarchical reinforcement learning method based on trajectory information(THRL)is proposed.THRL uses different types of information of option trajectories to guide the selection of low-level actions,and also generates inferred options by the obtained extended trajectory information.A discriminator is introduced to use the inferred options and the original options as inputs to obtain internal rewards,which makes the selection of low-level actions more consistent with the current option policy,thus solving the instability problem of low-level policies.The effectiveness of THRL is verified by applying it to the MuJoCo environment,along with the best deep reinforcement learning algorithms,and experimental results show that the THRL algorithm has better stability and performance.

作者徐亚鹏刘全栗军伟 XU Yapeng;LIU Quan;LI Junwei(School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China)

机构地区苏州大学计算机科学与技术学院苏州大学江苏省计算机信息处理技术重点实验室

出处《计算机科学》 CSCD 北大核心 2023年第12期314-321,共8页 Computer Science

基金国家自然科学基金(61772355,61702055,61876217,62176175) 江苏高校优势学科建设工程资助项目。

关键词 OPTION 分层强化学习轨迹信息鉴别器深度强化学习 Option Hierarchical reinforcement learning Trajectory information Discriminator Deep reinforcement learning

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1牟玮,童晶,韦剑,张明懿.基于改进CycleGAN的人脸剪纸自动生成[J].计算机系统应用,2023,32(12):1-11.
2陈志婷,庞凰琴.一题一课凸显本质——化归法在初中数学中的应用[J].数学教学通讯,2023(23):24-26.
3蒋佳玮,许惠靖,王春雨,胡奥杰,吴善玉.基于保护动机理论的自我管理干预在糖尿病前期人群中的应用效果[J].中国慢性病预防与控制,2023,31(7):554-558.
4陈杰,沈文怡,吴问宇,毛嘉莉.面向骑行地图推断的轨迹数据质量提升方法[J].华东师范大学学报（自然科学版）,2023(6):14-27.
5张薇,席红霞.基于轨迹预判的空间小目标在轨检测方法研究[J].光学与光电技术,2023,21(6):7-13.
6李永迪,李彩虹,张耀玉,张国胜,周瑞红,梁振英.基于APF-LSTM-DDPG算法的移动机器人局部路径规划[J].山东理工大学学报（自然科学版）,2024,38(1):33-41.
7吴继春,许可,周会成,范大鹏.利用CNC多维约束的S型速度规划[J].计算机辅助设计与图形学学报,2023,35(9):1439-1449.
8张晓平,李凯,王力,闫佳庆,何忠贺.一种具有情感和记忆机制的迷宫机器人认知模型[J].控制与决策,2023,38(10):2850-2858.
9李忠清,罗军.线上教学在临床见习中的应用现状及思考[J].新课程教学（电子版）,2023(13):137-138.
10黄珊珊,金鑫,吴楠,江倩.结合频域信息与对抗网络的虚假图像检测[J].信息安全学报,2023,8(6):37-47.

计算机科学

2023年第12期

浏览历史

内容加载中请稍等...

基于轨迹信息量的分层强化学习方法

相关作者

相关机构

相关主题

浏览历史