基于乐观探索的双延迟深度确定性策略梯度

Twin delayed deep deterministic policy gradient based on optimistic exploration

下载PDF

导出

摘要双延迟深度确定性策略梯度是深度强化学习的一个主流算法,是一种无模型强化学习,已成功应用于具有挑战性的连续控制任务中。然而,当环境中奖励稀疏或者状态空间较大时,双延迟深度确定性策略梯度的样本效率较差,环境探索能力较弱。针对通过双Q值函数的下界确定目标函数带来的低效探索问题,提出一种基于乐观探索的双延迟深度确定性策略梯度(TD3-OE)。首先,从双Q值函数出发,分析取下界会使得探索具有一定的悲观性;然后,利用高斯函数和分段函数分别对双Q值函数进行拟合;最后,利用拟合Q值函数和目标策略构造出探索策略,指导智能体在环境中进行探索。探索策略能够避免智能体学习到次优策略,从而有效解决低效探索的问题。该文在基于MuJoCo物理引擎的控制平台上将所提算法与基准算法进行试验对比,验证了所提算法的有效性。试验结果表明:所提算法在奖励、稳定性和学习速度等指标上均达到或超过其他基础强化学习算法。 Twin delayed deep deterministic policy gradient is a mainstream algorithm for deep reinforcement learning,and a model-free reinforcement learning that has been successfully applied to challenging continuous control tasks.However,when the rewards are sparse or the state space is large in the environment,the sample efficiency of the twin delayed deep deterministic policy gradient is poor and the environment exploration ability is weak.Aiming at the problem of inefficient exploration caused by determining the objective function through the lower bound of the double Q-value function,a twin delayed deep deterministic policy gradient based on optimistic exploration(TD3-OE)is proposed.First,starting from the double Q-value function,it is analyzed that taking the lower bound will make the exploration somewhat pessimistic.Then,the Gaussian function and the piecewise function are used to fit the double Q-value function respectively.Finally,the exploration policy is constructed by fitting the Q-value function and the target policy to guide the agent to explore in the environment.The exploration policy can prevent the agent from learning sub-optimal policies,thus effectively solving the problem of inefficient exploration.This paper compares the proposed algorithm with the benchmark algorithm on the control platform based on the MuJoCo physics engine to verify the effectiveness of the proposed algorithm.The experimental results show that the proposed algorithm achieves or exceeds other basic reinforcement learning algorithms in terms of indicators such as reward,stability and learning speed.

作者王浩宇张衡波程玉虎王雪松 Wang Haoyu;Zhang Hengbo;Cheng Yuhu;Wang Xuesong(School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116,China)

机构地区中国矿业大学信息与控制工程学院

出处《南京理工大学学报》 CAS CSCD 北大核心 2024年第3期300-309,共10页 Journal of Nanjing University of Science and Technology

基金国家自然科学基金(61976215,62176259) 江苏省自然科学基金(BK20221116) 江苏省卓越博士后计划(2022ZB530)。

关键词深度强化学习双延迟深度确定性策略梯度探索策略乐观探索 deep reinforcement learning twin delayed deep deterministic policy gradient exploration policy optimistic exploration

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献2

1刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：446
2刘建伟,高峰,罗雄麟.基于值函数和策略梯度的深度强化学习综述[J].计算机学报,2019,42(6):1406-1438. 被引量：124

二级参考文献11

1魏英姿 ,赵明扬 .一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771. 被引量：19
2高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
3王皓,高阳,陈兴国.强化学习中的迁移:方法和进展[J].电子学报,2008,36(B12):39-43. 被引量：26
4孙志军,薛磊,许阳明,王正.深度学习研究综述[J].计算机应用研究,2012,29(8):2806-2810. 被引量：602
5余凯,贾磊,陈雨强,徐伟.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804. 被引量：599
6傅启明,刘全,王辉,肖飞,于俊,李娇.一种基于线性函数逼近的离策略Q(λ)算法[J].计算机学报,2014,37(3):677-686. 被引量：25
7黎亚雄,张坚强,潘登,胡惮.基于RNN-RBM语言模型的语音识别研究[J].计算机研究与发展,2014,51(9):1936-1944. 被引量：27
8杨钊,陶大鹏,张树业,金连文.大数据下的基于深度神经网的相似汉字识别[J].通信学报,2014,35(9):184-189. 被引量：28
9赵冬斌,邵坤,朱圆恒,李栋,陈亚冉,王海涛,刘德荣,周彤,王成红.深度强化学习综述:兼论计算机围棋的发展[J].控制理论与应用,2016,33(6):701-717. 被引量：127
10刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：446

共引文献527

1傅汇乔,唐开强,邓归洲,王鑫鹏,陈春林.基于深度强化学习的六足机器人运动规划[J].智能科学与技术学报,2020(4):361-371. 被引量：2
2刘朝阳,穆朝絮,孙长银.深度强化学习算法与应用研究现状综述[J].智能科学与技术学报,2020(4):314-326. 被引量：37
3韩志豪,汪益兵,张宇,郝永志.基于深度强化学习的船舶航线自动规划[J].中国航海,2021,44(1):100-105. 被引量：9
4李茹杨,彭慧民,李仁刚,赵坤.强化学习算法与应用综述[J].计算机系统应用,2020,29(12):13-25. 被引量：39
5周瑶瑶,李烨.基于排序优先经验回放的竞争深度Q网络学习[J].计算机应用研究,2020,37(2):486-488. 被引量：5
6李逊,李俊超,邓林忠,康旭云,欧启捷,劳恒辉.人工智能优化技术在钢筋混凝土结构的应用[J].建筑结构,2023,53(S02):1425-1430. 被引量：1
7王雪鉴,文永明,石晓荣,张宁宁,刘洁玺.多智能体多耦合任务混合式智能决策架构设计[J].航空学报,2023,44(S02):418-425.
8刘洋,李建军.深度确定性策略梯度算法优化[J].辽宁工程技术大学学报（自然科学版）,2020(6):545-549. 被引量：1
9蒋方庆,陈自力,高喜俊,王春峰,贺道坤.基于改进TD3算法的无人机决策研究[J].信息化研究,2023,49(3):36-42.
10甘惟,吴志强,王元楷,徐浩文,严娟,何珍,赵紫辰.AIGC辅助城市设计的理论模型建构[J].城市规划学刊,2023(2):12-18. 被引量：10

1汪宇欣,武志文,黄天坤,陈茂林,叶剑民,郭云涛,王云冰,刘旭辉,胡鹏.微牛级高精度直流离子推进系统的推力调节特性[J].宇航学报,2024,45(5):770-777.

南京理工大学学报

2024年第3期

浏览历史

内容加载中请稍等...

基于乐观探索的双延迟深度确定性策略梯度

参考文献2

二级参考文献11

共引文献527

相关作者

相关机构

相关主题

浏览历史