一种基于联合神经网络的连续空间行动者评论家学习方法

An Actor-Critic Learning Approach Based on Joint Neural Network in Continuous Space

下载PDF

导出

摘要在复杂的连续空间应用场景中,经典的离散空间强化学习方法已难以满足实际需要,而已有的连续空间强化学习方法主要采用线性拟合方法逼近状态值函数和动作选择函数,存在精度不高的问题。提出一种基于联合神经网络非线性行动者评论家方法(actor-critic approach based on union neural network,UNN-AC)。该方法将动作选择函数和评论值函数表示为统一的联合神经网络模型,利用联合神经网络非线性拟合状态值函数和动作选择概率。与已有的线性拟合方法相比,非线性UNN-AC提高了对评论值函数和动作选择函数的拟合精度。实验结果表明,UNN-AC算法能够有效求解连续空间中近似最优策略问题。与经典的连续动作空间算法相比,该算法具有收敛速度快和稳定性高的优点。 In the complex application scenarios of continuous space,it has been difficult for the classical reinforcement learning method in discrete space to meet the practical needs.The existing reinforcement learning method in continuous space mainly,however,uses linear fitting method to approximate the state value function and action selection function,and consequently has the problem of low accuracy.A nonlinear joint neural network based actor-critic approach(UNN-AC)is proposed in this paper.The action selection function and the evaluation value function are expressed as a unified joint neural network model.The joint neural network is used to fit the state value function and the action selection probability nonlinearly.Compared with the existing linear fitting methods,the non-linear UNN-AC can improve the fitting accuracy of the comment value function and the action selection function.The results show that the UNN-AC algorithm can effectively solve the approximate optimal strategy problem in continuous space.Compared with the classical continuous action space algorithm,the algorithm has the advantages of fast convergence and high stability.

作者杨金鸿谭斌皇甫立熊璋 YANG Jinhong;TAN Bin;HUANGFU Li;XIONG Zhang(Systems Engineering Research Institute of CSSC,Beijing 100094,China;College of Computer Science&Technology,Beihang University,Beijing 100192,China)

机构地区中国船舶工业系统工程研究院北京航空航天大学计算机科学与技术学院

出处《智能安全》 2022年第2期19-25,共7页

关键词联合神经网络连续空间行动者评论家非线性 joint neural network continuous space actor-critic nonlinear

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献3

1丁林静,杨啟明.基于强化学习的无人机空战机动决策[J].航空电子技术,2018,49(2):29-35. 被引量：14
2朱斐,刘全,傅启明,伏玉琛.一种用于连续动作空间的最小二乘行动者-评论家方法[J].计算机研究与发展,2014,51(3):548-558. 被引量：9
3刘智斌,曾晓勤,刘惠义,储荣.基于BP神经网络的双层启发式强化学习方法[J].计算机研究与发展,2015,52(3):579-587. 被引量：38

二级参考文献45

1Sutton R S,Barto A G. Reinforcement Learning:An Introduction[M].Cambridge,MA:MITPress,1998.
2Sutton R S,Modayil J,Delp M. A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction[A].Richland,SC:International Foundation for Autonomous Agents and Multiagent Systems,2011.761-768.
3Silver D,Sutton R S,Müller M. Temporal-difference search in computer Go[J].Machine Learning,2012,(02):183-219.
4Sutton R S,McAllester D,Singh S. Policy gradient methods for reinforcement learning with function approximation[A].Cambridge,MA:The MIT Press,2000.1057-1063.
5Jan P,Stefan S. Natural actor critic[J].NEUROCOMPUTING,2008,(07):1180-1190.
6Jan P,Vijayakumar S,Stefan S. Reinforcement learning for humanoid robotics[A].Piscataway,NJ:IEEE,2003.1-20.
7Degris T,Pilarski P M,Sutton R S. Model-free reinforcement learning with continuous action in practice[A].Piscataway,NJ:IEEE,2012.2177-2182.
8van Hasselt H,Wiering M. Reinforcement learning in continuous action spaces[A].Piscataway,NJ:IEEE,2007.272-279.
9van Hasselt H. Reinforcement Learning:State of the Art[M].Berlin:Springer-Verlag,2007.207-251.
10Busoniu L,Babuska R,De Schutter B. Reinforcement Learning and Fynamic Programming Using Function Approximators[M].New York:CRC Press,2010.

共引文献58

1朱新乐.基于BP神经网络的绿色供应链优化研究[J].运输经理世界,2023(11):156-158.
2何宏,李宇,张志宏.基于图像识别的爬壁机器人的路径规划[J].工业技术创新,2015,2(3):267-271. 被引量：1
3李志伟.提高BP神经网络学习速率的算法研究[J].考试周刊,2016,0(21):103-103. 被引量：3
4朱斐,许志鹏,刘全,伏玉琛,王辉.基于可中断Option的在线分层强化学习方法[J].通信学报,2016,37(6):65-74. 被引量：4
5雷显玉,刘勇,黄广君.面向SLA的云计算负载均衡策略[J].计算机测量与控制,2016,24(7):219-223. 被引量：2
6黄磊,王凡,吴素萍.BP算法的多核并行研究及其在枣无损检测的应用[J].计算机工程与设计,2016,37(9):2502-2506. 被引量：1
7谢振平,孙桃.自组织决策树的联想记忆在线学习模型[J].模式识别与人工智能,2017,30(1):21-31. 被引量：3
8潘庆先,董红斌,韩启龙,王莹洁,丁蕊.一种基于BP神经网络的属性重要性计算方法[J].中国科学技术大学学报,2017,47(1):18-25. 被引量：28
9朱斐,刘全,傅启明,陈冬火,王辉,伏玉琛.一种不稳定环境下的策略搜索及迁移方法[J].电子学报,2017,45(2):257-266. 被引量：3
10王雪丽.一种基于动态迁移的智能存储算法研究[J].安阳工学院学报,2017,16(4):94-97. 被引量：1

1宋姿颐.场景功能·精神观照·历史书写:电影空间形态的多元样式论[J].电影评介,2024(1):28-33.
2李平,刘根,韩璐迪.建筑产业互联网平台三方演化博弈与仿真[J].四川建材,2024,50(2):32-35.
3汪文辉,陆金桂.基于改进蚁群算法优化神经网络的焊缝成形预测研究[J].煤矿机械,2024,45(2):176-178.
4徐延民.算法伦理的多维批判及其空间转向[J].探索与争鸣,2023(12):127-134. 被引量：1
5韩一宁,张程彬,郭敏嘉,赵男,崔明建.基于深度强化学习面向虚假拓扑攻击和拓扑优化的电网调度方法[J].智慧电力,2024,52(3):25-31. 被引量：1
6王晓涵,王锋.一种求解线弹性问题的无闭锁低阶虚拟元方法[J].南京师大学报（自然科学版）,2024,47(1):1-6.
7王骊,翁慧颖,孙小江.基于图注意力机制的车辆路径问题研究[J].信息技术与信息化,2024(2):122-125.
8刘秀梅,陈一奔.融媒体时代红色文化资源的数字化开发与有效传播研究——以江西省赣州市于都县的实践探索案例为中心[J].江西开放大学学报,2024,26(1):72-82.
9肖冰,张海朝.航天器姿态稳定强化学习鲁棒最优控制方法[J].航空学报,2024,45(1):51-65. 被引量：2
10王俊陆,张桂月,杜立宽,李素,陈廷伟.面向主从区块链的多级索引构建方法[J].计算机研究与发展,2024,61(3):799-807.

智能安全

2022年第2期

浏览历史

内容加载中请稍等...

一种基于联合神经网络的连续空间行动者评论家学习方法

参考文献3

二级参考文献45

共引文献58

相关作者

相关机构

相关主题

浏览历史