期刊文献+

联合随机性策略的深度强化学习探索方法

Efficient exploration with stochastic policy for deep reinforcement learning
下载PDF
导出
摘要 目前深度强化学习算法已经可以解决许多复杂的任务,然而如何平衡探索和利用的关系仍然是强化学习领域的一个基本的难题,为此提出一种联合随机性策略的深度强化学习探索方法。该方法利用随机性策略具有探索能力的特点,用随机性策略生成的经验样本训练确定性策略,鼓励确定性策略在保持自身优势的前提下学会探索。通过结合确定性策略算法DDPG和提出的探索方法,得到基于随机性策略指导的确定性策略梯度算法(SGDPG)。在多个复杂环境下的实验表明,面对探索问题,SGDPG的探索效率和样本利用率要优于DDPG算法。 At present,deep reinforcement learning algorithms have been shown to solve many complex tasks,but how to balance the relationship between exploration and exploitation is still a basic problem.Thus,this paper proposes an efficient exploration strategy combined with stochastic policy for deep reinforcement learning.The main contribution is to use the experience generated by stochastic policies to train deterministic policies,which encourages deterministic strategies to learn to explore while maintaining their own advantages.This takes advantage of the exploration ability of stochastic policies.By combining DDPG(Deep Deterministic Policy Gradient)and the proposed exploration method,the algorithm called stochastic guidance for deterministic policy gradient(SGDPG)is obtained.Finally,the results of the experiment in several complex environments show that SGDPG has higher exploration and sample efficiency than DDPG when faced with deep exploration problems.
作者 杨尚彤 王子磊 Yang Shangtong;Wang Zilei(School of Cyberspace Security,University of Science and Technology of China,Hefei 230027,China)
出处 《信息技术与网络安全》 2021年第6期43-49,共7页 Information Technology and Network Security
基金 国家自然科学基金(61836008,61673362)。
关键词 强化学习 深度强化学习 探索利用困境 reinforcement learning deep reinforcement learning exploration-exploitation dilemma
  • 相关文献

参考文献1

二级参考文献14

  • 1Puterman M L.Markov Decision Process:Discrete Dynamic Dtochastic Programming.New-York:Wiley,1994
  • 2Kaya M,Alhajj R.Fuzzy olap association rules mining based modular reinforcement learning approach for multiagent systems.IEEE Transactions on Systems,Man and Cybernetics part B:Cybernetics,2005,35(2):326-338
  • 3Singh S,Bertsekas D.Reinforcement learning for dynamic channel allocation in cellular telephone systems//Mozer M C,Jordan M L,Petsche T.Proceedings of the NIPS-9.Cambridge MA:MIT Press,1997:974
  • 4Vengerov D N,Berenji H R.A fuzzy reinforcement learning approach to power control in wireless transmitters.IEEE Transactions on Systems,Man,and Cybernetics part B:Cybernetics,2005,35(4):768-778
  • 5Critesl R H,Barto A G.Elevator group control using multiple reinforcement learning Agents.Machine Learning,1998,33(2/3):235-262
  • 6Kaelbling L P,Littman M L,Moore A P.Reinforcement learning:A survey.Journal of Artificial Intelligence Research,1996,4:237-285
  • 7Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge MA:MIT Press,1998
  • 8Schwartz A.A reinforcement learning method for maximizing undiscounted rewards//Huns M N,Singh M P eds.Proceedings of the 10th Annual Conference on Machine Learning.San Francisco:Morgan Kaufmann,1993:298-305
  • 9Tadepalli P,Ok D.Model-based average reward reinforcement learning.Artificial Intelligence,1998,100(1/2):177-224
  • 10Gosavi A.Reinforcement learning for long run average cost.European Journal of Operational Research,2004,155 (3):654-674

共引文献37

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部