强化学习中基于权重梯度下降的函数逼近方法

Function approximation method based on weights gradient descent in reinforcement learning

下载PDF

导出

摘要函数逼近法(function approximation)是强化学习领域中的一个研究热点,可以有效处理强化学习中大规模、连续状态和动作空间的问题。基于梯度下降(gradient descent)的函数逼近方法虽然是强化学习中使用最广泛的方法之一,但该算法对步长参数的要求较高,取值不当易产生收敛速度慢、收敛不稳定甚至发散的情况。针对这类问题,通过围绕基于函数逼近的TD(TD,temporal difference)算法,在最小二乘方法和梯度下降方法的基础上对权重的更新方法进行了改进,利用最小二乘方法处理值函数求解权重值,并结合时序差分和梯度下降的思想求出权重之间的误差,并利用该误差直接更新权重,从而提出一种权重梯度下降(WGD,weight gradient descent)方法。该方法以全新的方式更新权重,有效降低算法对计算资源的消耗,并且可以有效地对其他基于梯度下降的函数逼近算法进行改进,广泛应用于诸多基于梯度下降的强化学习算法。实验表明,WGD方法能够在更广泛的空间中调整参数,可以有效降低算法发散的可能性,在保证算法拥有良好收敛效果的同时,提高算法的收敛速度。 Function approximation has gained significant attention in reinforcement learning research as it effectively addresses problems with large-scale,continuous state,and action space.Although the function approximation algorithm based on gradient descent method is one of the most widely used methods in reinforcement learning,it requires careful tuning of the step size parameter as an inappropriate value can lead to slow convergence,unstable convergence,or even divergence.To address these issues,an improvement was made around the temporal-difference(TD)algorithm based on function approximation.The weight update method was enhanced using both the least squares method and gradient descent,resulting in the proposed weights gradient descent(WGD)method.The least squares were used to calculate the weights,combining the ideas of TD and gradient descent to find the error between the weights.And this error was used to directly update the weights.By this method,the weights were updated in a new manner,effectively reducing the consumption of computing resources by the algorithm enhancing other gradient descent-based function approximation algorithms.The WGD method is widely applicable in various gradient descent-based reinforcement learning algorithms.The results show that WGD method can adjust parameters within a wider space,effectively reducing the possibility of algorithm divergence.Additionally,it achieves better performance while improving the convergence speed of the algorithm.

作者秦晓燕刘禹含徐云龙李斌 QIN Xiaoyan;LIU Yuhan;XU Yunlong;LI Bin(School of Information and Software,Global Institute of Software Technology,Suzhou 215163,China;University of Waterloo,Waterloo,N2L3G4,Canada;Applied Technology College,Soochow University,Suzhou 215325,China;School of Computer Science and Technology,Soochow University,Suzhou 215325,China)

机构地区苏州高博软件技术职业学院信息与软件学院滑铁卢大学苏州大学应用技术学院苏州大学计算机科学与技术学院

出处《网络与信息安全学报》 2023年第4期16-28,共13页 Chinese Journal of Network and Information Security

基金国家自然科学基金(61772355,61702055,61876217,62176175) 江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004) 苏州市应用基础研究计划工业部分(SYG201422) 江苏省高职院校教师专业带头人高端研修项目(2021GRFX052) 江苏高校优势学科建设工程资助项目江苏省职业教育软件技术“双师型”名师工作室资助项目。

关键词函数逼近强化学习梯度下降最小二乘权重梯度下降 function approximation reinforcement learning gradient descent least-squares weights gradient descent

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献2

1陈兴国,俞扬.强化学习及其在电脑围棋中的应用[J].自动化学报,2016,42(5):685-695. 被引量：32
2郭潇逍,李程,梅俏竹.深度学习在游戏中的应用[J].自动化学报,2016,42(5):676-684. 被引量：22

二级参考文献71

1Werbos P. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences [Ph.D. dissertation], Harvard University, USA, 1974.
2Parker D B. Learning Logic, Technical Report TR-47, MIT Press, Cambridge, 1985.
3LeCun Y. Une proc6dure d'apprentissage pour R6seau seuil assymatrique (a learning scheme for asymmetric threshold networks). In: Proceddings of the Cognitiva 85. Paris, France. 599-604 (in French).
4Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors. Nature, 1986, 323(6088): 533-536.
5Bengio Y. Learning Deep Architectures for AI. Hanover MA: Now Publishers Inc. 2009.
6Hinton G E, Osindero S, Teh Y W. A fast learning algo- rithm for deep belief nets. Neural Computation, 2006, 18(7): 1527-1554.
7Ranzato M, Poultney C, Chopra S, LeCun Y. Efficient learn- ing of sparse representations with an energy-based model. In: Proceedings of the 2007 Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2007.
8Bengio Y, Lamblin P, Popovici D, Larochelle H. Greedy layer-wise training of deep networks. In: Proceedings of the 2007 Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2007.
9Erhan D, Manzagol P A, Bengio Y, Bengio S, Vincent P. The difficulty of training deep architectures and the effect of unsupervised pre-training. In: Proceedings of the 12th Inter- national Conference on Artificial Intelligence and Statistics. Clearwater, Florida, USA: AISTATS, 2009. 153-160.
10Glorot X, Bengio Y. Understanding the difficulty of train- ing deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: ICAIS, 2010.

共引文献51

1胡祥仁,陆林,王云生,商军,王保文,黄礼法.急性鱼胆中毒86例临床分析[J].中华内科杂志,2000,39(4):273-274. 被引量：87
2张强,杨任农,俞利新,张涛,左家亮.基于Q-network强化学习的超视距空战机动决策[J].空军工程大学学报（自然科学版）,2018,19(6):8-14. 被引量：19
3王庆福.基于神经网络的深度学习方法研究[J].电脑编程技巧与维护,2016(12):49-50. 被引量：1
4赵冬斌,邵坤,朱圆恒,李栋,陈亚冉,王海涛,刘德荣,周彤,王成红.深度强化学习综述:兼论计算机围棋的发展[J].控制理论与应用,2016,33(6):701-717. 被引量：131
5王奇,秦进.基于动作空间划分的MAXQ自动分层方法[J].计算机应用,2017,37(5):1357-1362.
6李飞,高晓光,万开方.基于改进并行回火算法的RBM网络训练研究[J].自动化学报,2017,43(5):753-764. 被引量：7
7李飞,高晓光,万开方.基于权值动量的RBM加速学习算法研究[J].自动化学报,2017,43(7):1142-1159. 被引量：11
8乔俊飞,王功明,李晓理,韩红桂,柴伟.基于自适应学习率的深度信念网设计与应用[J].自动化学报,2017,43(8):1339-1349. 被引量：20
9秦蕊,曾帅,李娟娟,袁勇.基于深度强化学习的平行企业资源计划[J].自动化学报,2017,43(9):1588-1596. 被引量：15
10左家亮,杨任农,张滢,李中林,邬蒙.基于启发式强化学习的空战机动智能决策[J].航空学报,2017,38(10):212-225. 被引量：54

1邢慧敏,黄佳琪,王洋,付得盛,汤圆圆,刘丹.社会衰弱的研究进展[J].职业与健康,2023,39(15):2139-2142. 被引量：1
2王庆领,王雪娆.切换拓扑下非线性多智能体系统自适应神经网络一致性[J].控制理论与应用,2023,40(4):633-640.
3李建标,陈建福,高滢,裴星宇,吴宏远,陆子凯,周少雄,曾杰.基于RG-DDPG的直流微网能量管理策略[J].中国电力,2023,56(7):85-94. 被引量：1
4康家雄,李爱秀,靳玉瑞,肖泽云.4-氨基-1-羟基-2-氧-1,8-萘啶-3-甲酰胺类化合物抑制HIV整合酶链转移活性的主要微观结构因素探究[J].化学与生物工程,2023,40(6):16-21.
5吴晓军,曾培耘,邓林.基于时间因子的蚁群算法在观光车路径规划中的应用[J].中国特种设备安全,2023,39(8):13-17. 被引量：1
6李牧辰,封思贤.数字普惠金融、数字门槛与城乡收入差距[J].管理评论,2023,35(6):57-71. 被引量：37
7夏良伟,朱明.基于视角置信度和注意力的暴力行为识别[J].计算机系统应用,2023,32(9):211-220.
8吴海杰,符艺超,王联智,谢敏,周吉星.基于双边全变差滤波算法的台区线损模式自动识别方法[J].自动化技术与应用,2023,42(9):76-79. 被引量：2
9郭震宇,邱熙雯,赖烨辉,胡文玉.时序差分低秩约束的人体运动数据恢复研究[J].赣南师范大学学报,2023,44(3):41-49.
10Peng Wang,Yulu Tian,Bolong Men,Hailong Song.Robust Frequency Estimation Under Additive Symmetric α-Stable Gaussian Mixture Noise[J].Intelligent Automation & Soft Computing,2023(4):83-95.

网络与信息安全学报

2023年第4期

浏览历史

内容加载中请稍等...

强化学习中基于权重梯度下降的函数逼近方法

参考文献2

二级参考文献71

共引文献51

相关作者

相关机构

相关主题

浏览历史