联合随机性策略的深度强化学习探索方法

Efficient exploration with stochastic policy for deep reinforcement learning

下载PDF

导出

摘要目前深度强化学习算法已经可以解决许多复杂的任务,然而如何平衡探索和利用的关系仍然是强化学习领域的一个基本的难题,为此提出一种联合随机性策略的深度强化学习探索方法。该方法利用随机性策略具有探索能力的特点,用随机性策略生成的经验样本训练确定性策略,鼓励确定性策略在保持自身优势的前提下学会探索。通过结合确定性策略算法DDPG和提出的探索方法,得到基于随机性策略指导的确定性策略梯度算法(SGDPG)。在多个复杂环境下的实验表明,面对探索问题,SGDPG的探索效率和样本利用率要优于DDPG算法。 At present,deep reinforcement learning algorithms have been shown to solve many complex tasks,but how to balance the relationship between exploration and exploitation is still a basic problem.Thus,this paper proposes an efficient exploration strategy combined with stochastic policy for deep reinforcement learning.The main contribution is to use the experience generated by stochastic policies to train deterministic policies,which encourages deterministic strategies to learn to explore while maintaining their own advantages.This takes advantage of the exploration ability of stochastic policies.By combining DDPG(Deep Deterministic Policy Gradient)and the proposed exploration method,the algorithm called stochastic guidance for deterministic policy gradient(SGDPG)is obtained.Finally,the results of the experiment in several complex environments show that SGDPG has higher exploration and sample efficiency than DDPG when faced with deep exploration problems.

作者杨尚彤王子磊 Yang Shangtong;Wang Zilei(School of Cyberspace Security,University of Science and Technology of China,Hefei 230027,China)

机构地区中国科学技术大学网络空间安全学院

出处《信息技术与网络安全》 2021年第6期43-49,共7页 Information Technology and Network Security

基金国家自然科学基金(61836008,61673362)。

关键词强化学习深度强化学习探索利用困境 reinforcement learning deep reinforcement learning exploration-exploitation dilemma

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献1

1高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38

二级参考文献14

1Puterman M L.Markov Decision Process:Discrete Dynamic Dtochastic Programming.New-York:Wiley,1994
2Kaya M,Alhajj R.Fuzzy olap association rules mining based modular reinforcement learning approach for multiagent systems.IEEE Transactions on Systems,Man and Cybernetics part B:Cybernetics,2005,35(2):326-338
3Singh S,Bertsekas D.Reinforcement learning for dynamic channel allocation in cellular telephone systems//Mozer M C,Jordan M L,Petsche T.Proceedings of the NIPS-9.Cambridge MA:MIT Press,1997:974
4Vengerov D N,Berenji H R.A fuzzy reinforcement learning approach to power control in wireless transmitters.IEEE Transactions on Systems,Man,and Cybernetics part B:Cybernetics,2005,35(4):768-778
5Critesl R H,Barto A G.Elevator group control using multiple reinforcement learning Agents.Machine Learning,1998,33(2/3):235-262
6Kaelbling L P,Littman M L,Moore A P.Reinforcement learning:A survey.Journal of Artificial Intelligence Research,1996,4:237-285
7Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge MA:MIT Press,1998
8Schwartz A.A reinforcement learning method for maximizing undiscounted rewards//Huns M N,Singh M P eds.Proceedings of the 10th Annual Conference on Machine Learning.San Francisco:Morgan Kaufmann,1993:298-305
9Tadepalli P,Ok D.Model-based average reward reinforcement learning.Artificial Intelligence,1998,100(1/2):177-224
10Gosavi A.Reinforcement learning for long run average cost.European Journal of Operational Research,2004,155 (3):654-674

共引文献37

1Di Cao,Weihao Hu,Junbo Zhao,Guozhou Zhang,Bin Zhang,Zhou Liu,Zhe Chen,Frede Blaabjerg.Reinforcement Learning and Its Applications in Modern Power and Energy Systems: A Review[J].Journal of Modern Power Systems and Clean Energy,2020,8(6):1029-1042. 被引量：16
2李瑾,刘全,杨旭东,杨凯,翁东良.一种改进的平均奖赏强化学习方法在RoboCup训练中的应用[J].苏州大学学报（自然科学版）,2012,28(2):21-26. 被引量：2
3张捍东,吴玉秀,岑豫皖.多机器人合作与协调研究进展[J].计算机工程与应用,2008,44(24):238-241. 被引量：4
4王巍巍,陈兴国,高阳.一种结合Tile Coding的平均奖赏强化学习算法[J].模式识别与人工智能,2008,21(4):446-452.
5王冠军,王茂励,赵莹.基于马尔可夫决策模型的测试向量排序新方法[J].计算机科学,2010,37(5):287-290. 被引量：1
6付燕宁,张家臣,刘磊.面向预定义过程的强化学习WS组合[J].吉林大学学报（工学版）,2010,40(5):1313-1317.
7刘全,傅启明,龚声蓉,伏玉琛,崔志明.最小状态变元平均奖赏的强化学习方法[J].通信学报,2011,32(1):66-71. 被引量：15
8彭正辉.摩天大楼里的复仇女郎[J].传奇故事（百家讲堂）,2000(8):4-24.
9李誌,胡坤,余雪丽.基于半马氏博弈模型的分层强化学习研究[J].计算机工程与设计,2012,33(9):3558-3562. 被引量：1
10付燕宁,赵东范,赵健.持续自适应的Web服务组合方法[J].吉林大学学报（理学版）,2012,50(5):972-978.

1朱黎.浅谈历史与政治学科融合教学的探索[J].高考,2021(5):132-133. 被引量：1
2李冯宏.基于核心素养培养的高三化学复习策略探究[J].山海经,2021(14):0172-0172.
3陈红光.基于思维导图的词汇教学策略[J].课程教材教学研究（小教研究）,2020(11):36-36.
4李竹.“零展品”博物馆展品体系构建——以中国大运河博物馆展品征集为例[J].东南文化,2021(3):125-130. 被引量：3
5姜义军.系统论视野下的农民科技增收[J].系统科学学报,2021,29(2):90-94.
6李轶.长非编码RNA表征及其在结直肠癌发生发展中的作用和临床意义[J].中国生物化学与分子生物学报,2021,37(5):556-563. 被引量：2
7何准,董文瀚,蔡鸣,李大东.基于DDPG的多旋翼无人机自主引导与跟踪方法[J].飞行力学,2021,39(2):63-69. 被引量：5
8高昂,董志明,叶红兵,宋敬华,郭齐胜.基于深度强化学习的巡飞弹突防控制决策[J].兵工学报,2021,42(5):1101-1110. 被引量：15
9马奔宇,张文哲,夏清友.一种基于CNPO-IMS平台的核电站在役检查无损检测报告文件生成系统研制[J].科技创新导报,2021,18(9):86-88.
10徐泽洲,曲大义,洪家乐,宋晓晨.智能网联汽车自动驾驶行为决策方法研究[J].复杂系统与复杂性科学,2021,18(3):88-94. 被引量：9

信息技术与网络安全

2021年第6期

浏览历史

内容加载中请稍等...

联合随机性策略的深度强化学习探索方法

参考文献1

二级参考文献14

共引文献37

相关作者

相关机构

相关主题

浏览历史