期刊文献+

带最大熵修正的行动者评论家算法 被引量:5

Actor-Critic Algorithm with Maximum-Entropy Correction
下载PDF
导出
摘要 在行动者评论家算法中,策略梯度通常使用最大熵正则项来提高行动策略的随机性以保证探索.策略的随机使Agent能够遍历所有动作,但是会造成值函数的低估并影响算法的收敛速度与稳定性.针对策略梯度中最大熵正则项带来的低估问题,提出最大熵修正(Maximum-Entropy Correction,MEC)算法.该算法有两个特点:(1)利用状态值函数与策略函数构造一种状态动作值函数的估计,构造的状态动作值函数符合真实值函数的分布;(2)将贝尔曼最优方程与构造的状态动作值函数结合作为MEC算法的目标函数.通过使用新的目标函数,MEC算法可以解决使用最大熵正则项带来的性能下降与不稳定.为了验证算法的有效性,将该算法与近似策略优化算法以及优势行动者评论家算法在Atari 2600游戏平台进行比较实验.实验结果表明,MEC在改进性能的同时提高了算法的稳定性. In recent years,Deep Reinforcement Learning(DRL)combines deep learning with reinforcement learning,and has become one of the research hotspots in the field of artificial intelligence.Deep learning combines intensive learning with amazing results in robot control,Atari 2600 games and more.The first algorithm that combines reinforcement learning with deep learning is TD-Gammon,which surpasses professional chess players in the problem of backgammon.Then there is a series of algorithms from Deep Q-Network(DQN),which has achieved extraordinary performance in Atari 2600 games.But limited by the value function method,DQN cannot be applied to continuous motion tasks.Therefore,the Actor-Critic(AC)algorithm is widely used in DRL because of its excellent scalability.It consists of two parts:the actor and the critic.Usually,the critic uses value function methods to evaluate the action,and the actor uses policy gradient methods for the action generation.This means that the AC method can be applied to continuous motion and discrete motion tasks.Researchers have proposed many AC algorithms,such as Asynchronous Advantage Actor-Critic(A3C)using asynchronous updates,Advantage Actor-Critic(A2C)using multi-agent updates,and Proximal Policy Optimization(PPO)with guaranteed policy improvement.One of the core challenges in reinforcement learning(RL)is how to balance the exploration and exploitation.In order to ensure a high return,the algorithm needs to find those actions that expect high returns in the experience explored in the past.The Immediate rewards for certain actions are high but their expected return is low,and the Agent needs to explore all untraversed actions to prevent the policy from entering local optimum.In other words,the Agent must use the previous experience to obtain higher expectations,and on the other hand must explore to find better action options.In the policy gradient,the maximum entropy regularity term is usually used to increase the randomness of the policy to ensure exploration.The randomness of the policy enables the Agent to traverse all the actions but it will cause the underestimation of the value function and affect the convergence speed and stability of the algorithm.In order to solve the underestimation problem caused by the maximum entropy regular term in the policy gradient,the Maximum-Entropy Correction(MEC)algorithm is proposed.The algorithm has two characteristics:(1)constructing a state action value function estimation using state value function and policy function,the constructed state action value function satisfies the distribution of the real value function;(2)The Berman optimal equation is combined with the constructed state action value function as the objective function of the MEC algorithm.The performance degradation and instability caused delete u the maximum entropy regular term can be solved by using the new objective function MEC algorithm.In order to verify the effectiveness of the algorithm,the algorithm was compared with PPO and A2C on the seven Atari 2600 games:BeamRide,Breakout,Enduro,Pong,Qbert,Seaquest and SpaceInvaders.Experimental results show that MEC improves the stability of the algorithm while improving performance.
作者 姜玉斌 刘全 胡智慧 JIANG Yu-Bin;LIU Quan;HU Zhi-Hui(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000)
出处 《计算机学报》 EI CSCD 北大核心 2020年第10期1897-1908,共12页 Chinese Journal of Computers
基金 国家自然科学基金项目(61772355,61702055,61472262,61502323,61502329) 江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004) 吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18) 苏州市应用基础研究计划工业部分(SYG201422) 江苏高校优势学科建设工程资助项目资助.
关键词 强化学习 深度学习 行动者评论家算法 最大熵 策略梯度 reinforcement learning deep learning actor-critic algorithm maximum entropy policy gradient
  • 相关文献

参考文献1

二级参考文献8

共引文献479

同被引文献39

引证文献5

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部