基于最小二乘的Q(λ)强化学习算法

Least-Squares based Q(λ) algorithm for reinforcement learning

下载PDF

导出

摘要通过分析经典的Q(λ)学习算法所存在的经验利用率低、收敛速度慢的问题,根据当前和多步的经验知识样本建立了状态-动作对值函数的最小二乘逼近模型,推导了该逼近函数在一组基底上的权向量所满足的一组线性方程,从而提出了快速而实用的最小二乘Q(λ)算法及改进的递推算法。倒立摆实验表明,该算法可以提高经验利用率,有效加快收敛速度。 The problem of slow convergence speed and low efficiency of experience exploitation in classical Q（λ） learning is analyzed.And then the Least-Squares approximation model of the state-action pair＇s value function is constructed according to current and previous experience.A set of linear equations is derived,which is satisfied by the weight vector of function approximator on a set of bases.Thus the fast and practical Least-Squares Q（λ） algorithm and improved recursive algorithm are proposed.The experiment of inverted pendulum demonstrates that these algorithms can effectively improve convergenee speed and the efficiency of experience exploitation.

作者陈圣磊李卫红姚娟

机构地区南京审计学院信息科学学院

出处《计算机工程与应用》 CSCD 北大核心 2008年第34期47-50,共4页 Computer Engineering and Applications

基金江苏省高校自然科学基础研究项目No.07KJD520092~~

关键词强化学习 Q(λ)学习函数逼近最小二乘倒立摆 reinforcement learning Q（λ ） learning function approximation Least-Squares inverted pendulum

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献11

1Watkins J C H,Dayan P.Q-learning[J].Machine Learning, 1992,8 ( 1) : 279-292.
2Sutton R S.Learning to predict by the methods of temporal differences[J].Machine Learning, 1988,3 : 9-44.
3徐昕,贺汉根.神经网络增强学习的梯度算法研究[J].计算机学报,2003,26(2):227-233. 被引量：21
4Barreto A d M S,Anderson C W.Restricted gradient-descent algorithm for value-function approximation in reinforcement learning[J]. Artificial Intelligence, 2008 : 454-482.
5Kaelbling L P,Littman M L,Moore A W.Reinforcement learning: A survey[J].Journal of Artificial Intelligence Research, 1996,4 : 237-285.
6Rezzoug N,Gorce P.A reinforcement learning based neural network architecture for obstacle avoidance in multi-fingered grasp synthesis[J].Neurocomputing, 2008,26( 1 ).
7Erden M S,Leblebicioglu K.Free gait generation with reinforcement learning for a six-legged robot[J].Robotics and Autonomous Systems, 2008 : 199-212.
8Peng J,Williams R J.Incremental multi-step q-learning[J].Machine Learning, 1996,22(4) : 283-290.
9Sutton R S,Barto A G.Reinforcement learning:An introduction[M]. Cambridge,MA:MIT Press, 1998.
10Lagoudakis M G,Parr R.Least-squares policy iteration[J].Journal of Machine Learning Research, 2003,4: 1107-1149.

二级参考文献16

1Baird L C. Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning (ICML95), Tahoe City, California, USA, 1995. 30～37
2Rumelhart D E et al. Learning internal representations by error propagation. In: Rumelhart D E et al, eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol.1,Cambridge, MA: MIT Press,1986. 318～362
3Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 1989, 2: 303～314
4Baird L C, Moore A. Gradient descent for general reinforcement learning. In: Kearns M S, Solla S A, Cohn D A eds. Advances in Neural Information Processing Systems 11, Cambrige, MA: MIT Press, 1999. 968～974
5Bertsekas D P, Tsitsiklis J N. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 2000, 10(3): 627～642
6Heger M. The loss from imperfect value functions in expectation-based and minimax-based tasks. Machine Learning, 1996, 22(1): 197～225
7Sutton R. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky D S, Mozer M C, Hasselmo M E eds. Advances in Neural Information Processing Systems 8, Cambrige, MA: MIT Press, 1996. 1038～1044
8Kaelbling L P et al. Reinforcement learning: A survey. Jour- nal of Artificial Intelligence Research, 1996, 4: 237～285
9Tesauro G J. Temporal difference learning and TD-gammon. Communications of the ACM, 1995, 38(3):58～68
10Crites R H, Barto A G. Elevator group control using multiple reinforcement learning agents. Machine Learning, 1998, 33(2/3):235～262

共引文献34

1董沛武,刘微微,娄岩峰.基于遗传算法和神经网络的企业核心竞争力评价模型研究[J].兵工学报,2009,30(S1):114-118. 被引量：6
2闫友彪,陈元琰.机器学习的主要策略综述[J].计算机应用研究,2004,21(7):4-10. 被引量：57
3王学宁,徐昕,吴涛,贺汉根.策略梯度强化学习中的最优回报基线[J].计算机学报,2005,28(6):1021-1026. 被引量：6
4叶德谦,杨樱,金大兵.基于神经网络集成的强化学习算法系统设计[J].计算机工程与应用,2006,42(12):97-99. 被引量：2
5周昌能,余雪丽.基于BP网络的权值更新快速收敛算法[J].计算机应用,2006,26(8):1940-1942. 被引量：6
6王雪松,程玉虎,易建强,王炜强.基于Elman网络的非线性系统增强式学习控制[J].中国矿业大学学报,2006,35(5):653-657. 被引量：8
7王惠,符策,谢益武,许瑞雪,杨小佳.面向伙伴选择的模糊Markov博弈控制及仿真研究[J].系统仿真学报,2007,19(15):3572-3576. 被引量：1
8王俊丽,胡彧.基于神经网络学习机制的应急决策支持中间件模型[J].山西电子技术,2007(4):57-58.
9苏浩铭,王浩.一种基于模型的强化学习算法[J].合肥工业大学学报（自然科学版）,2008,31(9):1447-1450.
10蚩志锋,闫珍珠,黄彪.基于遗传算法与BP算法的水质评价模型[J].重庆科技学院学报（自然科学版）,2009,11(1):122-124. 被引量：8

1孙萍.基于MATLAB编程的倒立摆实验的实现[J].高校实验室工作研究,2013(2):43-44. 被引量：1
2雷源春,师路欢.基于滑模变结构控制的二级倒立摆实验仿真研究[J].中国科技信息,2011(7):250-251. 被引量：1
3邓朝结.基于2-dof机械臂的平面倒立摆的建模与分析[J].现代企业教育,2009(22):76-77. 被引量：1
4王仲民,孙建军,岳宏.基于LQR的倒立摆最优控制系统研究[J].工业仪表与自动化装置,2005(3):6-8. 被引量：56
5崔平,翁正新.基于状态空间极点配置的倒立摆平衡控制[J].实验室研究与探索,2003,22(2):70-72. 被引量：12
6樊阳.基于以太网的倒立摆远程监控系统研究[J].工矿自动化,2008,34(2):124-126.
7余涛,胡细兵,刘靖.基于多步回溯Q(λ)学习算法的多目标最优潮流计算[J].华南理工大学学报（自然科学版）,2010,38(10):139-145. 被引量：7
8杨春雨,周林娜,姚佳明.虚拟倒立摆实验系统的设计[J].高校实验室工作研究,2014(3):23-25.
9傅启明,刘全,孙洪坤,高龙,李瑾,王辉.一种二阶TD Error快速Q(λ)算法[J].模式识别与人工智能,2013,26(3):282-292. 被引量：5
10李劲松,颜国正,冯剑舟,宋立博.基于线性二次型最优控制策略的倒立摆实验系统搭建[J].实验室研究与探索,2010,29(3):38-40. 被引量：8

计算机工程与应用

2008年第34期

浏览历史

内容加载中请稍等...

基于最小二乘的Q(λ)强化学习算法

参考文献11

二级参考文献16

共引文献34

相关作者

相关机构

相关主题

浏览历史