带最大熵修正的行动者评论家算法被引量：5

Actor-Critic Algorithm with Maximum-Entropy Correction

下载PDF

导出

摘要在行动者评论家算法中,策略梯度通常使用最大熵正则项来提高行动策略的随机性以保证探索.策略的随机使Agent能够遍历所有动作,但是会造成值函数的低估并影响算法的收敛速度与稳定性.针对策略梯度中最大熵正则项带来的低估问题,提出最大熵修正(Maximum-Entropy Correction,MEC)算法.该算法有两个特点:(1)利用状态值函数与策略函数构造一种状态动作值函数的估计,构造的状态动作值函数符合真实值函数的分布;(2)将贝尔曼最优方程与构造的状态动作值函数结合作为MEC算法的目标函数.通过使用新的目标函数,MEC算法可以解决使用最大熵正则项带来的性能下降与不稳定.为了验证算法的有效性,将该算法与近似策略优化算法以及优势行动者评论家算法在Atari 2600游戏平台进行比较实验.实验结果表明,MEC在改进性能的同时提高了算法的稳定性. In recent years,Deep Reinforcement Learning(DRL)combines deep learning with reinforcement learning,and has become one of the research hotspots in the field of artificial intelligence.Deep learning combines intensive learning with amazing results in robot control,Atari 2600 games and more.The first algorithm that combines reinforcement learning with deep learning is TD-Gammon,which surpasses professional chess players in the problem of backgammon.Then there is a series of algorithms from Deep Q-Network(DQN),which has achieved extraordinary performance in Atari 2600 games.But limited by the value function method,DQN cannot be applied to continuous motion tasks.Therefore,the Actor-Critic(AC)algorithm is widely used in DRL because of its excellent scalability.It consists of two parts:the actor and the critic.Usually,the critic uses value function methods to evaluate the action,and the actor uses policy gradient methods for the action generation.This means that the AC method can be applied to continuous motion and discrete motion tasks.Researchers have proposed many AC algorithms,such as Asynchronous Advantage Actor-Critic(A3C)using asynchronous updates,Advantage Actor-Critic(A2C)using multi-agent updates,and Proximal Policy Optimization(PPO)with guaranteed policy improvement.One of the core challenges in reinforcement learning(RL)is how to balance the exploration and exploitation.In order to ensure a high return,the algorithm needs to find those actions that expect high returns in the experience explored in the past.The Immediate rewards for certain actions are high but their expected return is low,and the Agent needs to explore all untraversed actions to prevent the policy from entering local optimum.In other words,the Agent must use the previous experience to obtain higher expectations,and on the other hand must explore to find better action options.In the policy gradient,the maximum entropy regularity term is usually used to increase the randomness of the policy to ensure exploration.The randomness of the policy enables the Agent to traverse all the actions but it will cause the underestimation of the value function and affect the convergence speed and stability of the algorithm.In order to solve the underestimation problem caused by the maximum entropy regular term in the policy gradient,the Maximum-Entropy Correction(MEC)algorithm is proposed.The algorithm has two characteristics:(1)constructing a state action value function estimation using state value function and policy function,the constructed state action value function satisfies the distribution of the real value function;(2)The Berman optimal equation is combined with the constructed state action value function as the objective function of the MEC algorithm.The performance degradation and instability caused delete u the maximum entropy regular term can be solved by using the new objective function MEC algorithm.In order to verify the effectiveness of the algorithm,the algorithm was compared with PPO and A2C on the seven Atari 2600 games:BeamRide,Breakout,Enduro,Pong,Qbert,Seaquest and SpaceInvaders.Experimental results show that MEC improves the stability of the algorithm while improving performance.

作者姜玉斌刘全胡智慧 JIANG Yu-Bin;LIU Quan;HU Zhi-Hui(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000)

机构地区苏州大学计算机科学与技术学院苏州大学江苏省计算机信息处理技术重点实验室吉林大学符号计算与知识工程教育部重点实验室软件新技术与产业化协同创新中心

出处《计算机学报》 EI CSCD 北大核心 2020年第10期1897-1908,共12页 Chinese Journal of Computers

基金国家自然科学基金项目(61772355,61702055,61472262,61502323,61502329) 江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004) 吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18) 苏州市应用基础研究计划工业部分(SYG201422) 江苏高校优势学科建设工程资助项目资助.

关键词强化学习深度学习行动者评论家算法最大熵策略梯度 reinforcement learning deep learning actor-critic algorithm maximum entropy policy gradient

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献1

1刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：480

二级参考文献8

1魏英姿 ,赵明扬 .一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771. 被引量：19
2高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
3王皓,高阳,陈兴国.强化学习中的迁移:方法和进展[J].电子学报,2008,36(B12):39-43. 被引量：27
4孙志军,薛磊,许阳明,王正.深度学习研究综述[J].计算机应用研究,2012,29(8):2806-2810. 被引量：624
5余凯,贾磊,陈雨强,徐伟.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804. 被引量：612
6傅启明,刘全,王辉,肖飞,于俊,李娇.一种基于线性函数逼近的离策略Q(λ)算法[J].计算机学报,2014,37(3):677-686. 被引量：26
7黎亚雄,张坚强,潘登,胡惮.基于RNN-RBM语言模型的语音识别研究[J].计算机研究与发展,2014,51(9):1936-1944. 被引量：27
8杨钊,陶大鹏,张树业,金连文.大数据下的基于深度神经网的相似汉字识别[J].通信学报,2014,35(9):184-189. 被引量：29

共引文献479

1傅汇乔,唐开强,邓归洲,王鑫鹏,陈春林.基于深度强化学习的六足机器人运动规划[J].智能科学与技术学报,2020(4):361-371. 被引量：3
2刘朝阳,穆朝絮,孙长银.深度强化学习算法与应用研究现状综述[J].智能科学与技术学报,2020(4):314-326. 被引量：46
3韩志豪,汪益兵,张宇,郝永志.基于深度强化学习的船舶航线自动规划[J].中国航海,2021,44(1):100-105. 被引量：9
4张磊,母亚双,潘泉.基于改进深度双Q网络的移动机器人路径规划算法[J].信息与控制,2024,53(3):365-376. 被引量：1
5李茹杨,彭慧民,李仁刚,赵坤.强化学习算法与应用综述[J].计算机系统应用,2020,29(12):13-25. 被引量：45
6周瑶瑶,李烨.基于排序优先经验回放的竞争深度Q网络学习[J].计算机应用研究,2020,37(2):486-488. 被引量：7
7李逊,李俊超,邓林忠,康旭云,欧启捷,劳恒辉.人工智能优化技术在钢筋混凝土结构的应用[J].建筑结构,2023,53(S02):1425-1430. 被引量：1
8王雪鉴,文永明,石晓荣,张宁宁,刘洁玺.多智能体多耦合任务混合式智能决策架构设计[J].航空学报,2023,44(S02):418-425.
9刘洋,李建军.深度确定性策略梯度算法优化[J].辽宁工程技术大学学报（自然科学版）,2020(6):545-549. 被引量：2
10蒋方庆,陈自力,高喜俊,王春峰,贺道坤.基于改进TD3算法的无人机决策研究[J].信息化研究,2023,49(3):36-42.

同被引文献39

1李茹杨,彭慧民,李仁刚,赵坤.强化学习算法与应用综述[J].计算机系统应用,2020,29(12):13-25. 被引量：45
2葛继科,邱玉辉,吴春明,蒲国林.遗传算法研究综述[J].计算机应用研究,2008,25(10):2911-2916. 被引量：418
3公茂果,焦李成,杨咚咚,马文萍.进化多目标优化算法研究[J].软件学报,2009,20(2):271-289. 被引量：400
4洪海生,江全元,严玉婷.实时平抑风电场功率波动的电池储能系统优化控制方法[J].电力系统自动化,2013,37(1):103-109. 被引量：70
5胡泽春,夏睿,吴林林,刘辉.考虑储能参与调频的风储联合运行优化策略[J].电网技术,2016,40(8):2251-2257. 被引量：86
6刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：480
7蒋正邦,吴浩,程祥,孙维真,商佳宜.基于多元聚类模型与两阶段聚类修正算法的变电站特性分析[J].电力系统自动化,2018,42(15):157-163. 被引量：10
8俞扬,钱超.演化学习专题前言[J].软件学报,2018,29(9):2545-2546. 被引量：1
9刘建伟,高峰,罗雄麟.基于值函数和策略梯度的深度强化学习综述[J].计算机学报,2019,42(6):1406-1438. 被引量：133
10刘全,闫岩,朱斐,吴文,张琳琳.一种带探索噪音的深度循环Q网络[J].计算机学报,2019,42(7):1588-1604. 被引量：11

引证文献5

1吕帅,龚晓宇,张正昊,韩帅,张峻伟.结合进化算法的深度强化学习方法研究综述[J].计算机学报,2022,45(7):1478-1499. 被引量：10
2李永迪,李彩虹,张耀玉,张国胜.基于改进SAC算法的移动机器人路径规划[J].计算机应用,2023,43(2):654-660. 被引量：7
3葛晓琳,凡婉秋,符杨,李仪.基于改进柔性策略评价的风火储多主体博弈电能–调频市场联合竞价模型[J].电网技术,2023,47(5):1920-1930. 被引量：4
4王子腾,于亚新,夏子芳,乔佳琪.融合好奇心和策略蒸馏的稀疏奖励探索机制[J].计算机应用,2023,43(7):2082-2090.
5冯涣婷,程玉虎,王雪松.基于不确定性估计的离线确定型Actor-Critic[J].计算机学报,2024,47(4):717-732.

二级引证文献21

1欧阳陈华,魏书堤,张汛.一种训练深度神经网络进行强化学习的方法[J].电脑与信息技术,2023,31(1):8-10.
2孙晨,莫国美,舒坚.基于强化学习的无人机自组网路由研究综述[J].计算机应用研究,2023,40(7):1937-1946. 被引量：4
3王小飞,陈永展,王强,高艳丽,李健增.面向大规模数据的SVDD在线学习算法[J].测控技术,2023,42(8):1-6. 被引量：1
4李璐璐,朱睿杰,隋璐瑶,李亚飞,徐明亮,樊会涛.智能集群系统的强化学习方法综述[J].计算机学报,2023,46(12):2573-2596. 被引量：2
5廖登宇,张震,赵德京,崔浩岩.基于多智能体深度强化学习的机器人协作搬运方法[J].电子设计工程,2023,31(23):7-11.
6Tingjun Lei,Timothy Sellers,Chaomin Luo,Daniel W.Carruth,Zhuming Bi.Graph-based robot optimal path planning with bio-inspired algorithms[J].Biomimetic Intelligence & Robotics,2023,3(3):75-90. 被引量：2
7李新凯,虎晓诚,马萍,张宏立.基于改进DDPG的无人驾驶避障跟踪控制[J].华南理工大学学报（自然科学版）,2023,51(11):44-55. 被引量：5
8葛星,秦丽,沙瀛.基于投影奖励机制的多机器人协同编队与避障[J].应用科学学报,2024,42(1):174-188.
9张一凡,于千城,张丽丝.基于进化集成学习的用户购买意向预测[J].计算机应用研究,2024,41(2):368-374.
10侯远韶.基于视觉跟踪的移动机器人路径规划研究[J].信息技术与信息化,2023(12):89-92. 被引量：1

1吴金金,刘全,陈松,闫岩.一种权重平均值的深度双Q网络方法[J].计算机研究与发展,2020,57(3):576-589. 被引量：5
2凌兴宏,李杰,朱斐,刘全,伏玉琛.基于双重注意力机制的异步优势行动者评论家算法[J].计算机学报,2020,43(1):93-106. 被引量：4
3曹永华,朱美蓉,尤铖.种植密度、施氮量及施磷量对玉米产量的影响[J].南方农业,2019(S01):55-58.
4陈凯,王新欣.中国水资源利用秩序评价[J].水利经济,2020,38(5):12-16. 被引量：1
5RuoXian Zhou,XuDong Gu,KeXin Yang,GuangSheng Li,BinBin Ni,Juan Yi,Long Chen,FuTai Zhao,ZhengYu Zhao,Qi Wang,LiQing Zhou.A detailed investigation of low latitude tweek atmospherics observed by the WHU ELF/VLF receiver:Ⅰ. Automatic detection and analysis method[J].Earth and Planetary Physics,2020,4(2):120-130. 被引量：13
6汤晓洁,尹文亚,程宜晋,雷双嘉,张俊英,仝昭晨.河南省平顶山市天然草地资源现状与分析[J].养殖与饲料,2020,19(9):115-117. 被引量：1
7李俊杰,李鑫.旋转设备转子不平衡故障诊断案例[J].新疆有色金属,2020,43(4):26-27. 被引量：1
8马宏伟,张珍珍,杨林,王川伟.巡检机器人全球定位系统和里程计组合定位方法[J].科学技术与工程,2020,20(23):9440-9444. 被引量：8
9Chen YANG,Songhua TANG,Zhenhua LUO.Distribution Changes of Chinese Skink(Eumeces chinensis) in China: the Impacts of Global Climate Change[J].Asian Herpetological Research,2020,11(2):132-138. 被引量：2

计算机学报

2020年第10期

浏览历史

内容加载中请稍等...

带最大熵修正的行动者评论家算法被引量：5

参考文献1

二级参考文献8

共引文献479

同被引文献39

引证文献5

二级引证文献21

相关作者

相关机构

相关主题

浏览历史

带最大熵修正的行动者评论家算法 被引量：5

参考文献1

二级参考文献8

共引文献479

同被引文献39

引证文献5

二级引证文献21

相关作者

相关机构

相关主题

浏览历史

带最大熵修正的行动者评论家算法被引量：5