一种基于性能势的无折扣强化学习算法被引量：2

Undiscounted Reinforcement Learning Algorithm Based on Performance Potentials

下载PDF

导出

摘要传统基于性能势的学习算法能获得马尔可夫决策问题的最优策略。这些算法主要采用单路径采样的方法,使得学习算法效率不高。将性能势与强化学习相结合,提出了一种基于性能势的无折扣值迭代学习算法——G学习,并将其与经典的无折扣强化学习算法(R学习)相比较,获得了较好的实验结果。 Traditional performance potential-based learning algorithms can obtain optimal policies in MDP problems. They mainly adopt single sample path based on methods which make them less efficient. In this paper,a new learning algorithm which utilizes performance potential and reinforcement learning is proposed. Compared with the classic R-learning algorithm ,it has promising results.

作者周如益高阳

机构地区南京大学计算机软件新技术国家重点实验室

出处《广西师范大学学报（自然科学版）》 CAS 北大核心 2006年第4期58-61,共4页 Journal of Guangxi Normal University:Natural Science Edition

基金国家自然科学基金资助项目(60475026)

关键词强化学习性能势无折扣值迭代 reinforcement learning performance potential undiscounted value iteration

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献9

1MANHADEVAN S.Average reward reinforcement learning:foundations,algorithms and empirical results[J].Machine Learning,1996,22:159-195.
2SUTTON R S,BARTO A.Reinforcement learning:an introduction[M].Cambridge,MA:MIT Press,1998.
3CAO Xi-ren,CHEN Han-fu.Perturbation realization,potentials and sensitivity analysis of Markov processes[J].IEEE Transactions of Automatic Control,1997,42:1382-1393.
4CAO Xi-ren.The relation among potentials,perturbation analysis and Markov decision processes[J].Journal of Discrete Event Dynamic Systems,1998,8:71-87.
5CAO Xi-ren.Single sample path based optimization of Markov chains[J].Journal of Optimization:Theory and Application,1999,100:527-548.
6FANG Hai-tao,CAO Xi-ren.Potential-based on-line policy iteration algorithms for markov decision processes[J].IEEE Transactions on Automatic Control,2004,49:493-505.
7FANG Hai-tao,CAO Xi-ren.Recursive approaches for single sample path based Markov reward processes[J].Asian Journal of Control,2001,3:21-26.
8CAO Xi-ren.From perturbation analysis to Markov decision processes and reinforcement learning[J].Journal of Discrete Event Dynamic Systems,2003,13:9-39.
9MARBACH P,TSITSIKLIS T N.Simulation based optimization of Markov reward process[R].Massachusetts:Laboratory for Information and Decision System,Massachusetts Institute of Technology,1998.

同被引文献11

1蒋伟进,许宇胜,吴泉源,孙星明.基于多智能体的分布式智能诊断方法研究[J].电子学报,2004,32(F12):235-237. 被引量：8
2高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
3Szepesvari C. Algorithms for reinforcement learning:Synthesis lectures on artificial intelligence and machine learning[M].San Rafael:Morgan & Claypool Pulishers,2009.2-3.
4Chatterjee K,Majumadar R,Henzinge A T. Stochastic limitaverage games are in exptime[J].International Journal in Game Theory,2007,(02):219-234.
5Tadepalli P,D OK. Model-based average reward reinforcement learning[J].Artificial Intelligence,1998,(1-2):177-224.
6Sun T,Zhao Q,Luh P B. A rollout algorithm for multi chain Markov decision processes with average cost[J].Positive Systems,2009.151-162.
7Yanjie L. An average reward performance potential estimation with geometric variance reduction[A].2012.2061-2065.
8Cao X R. Stochastic learning and optimization:A sensitivitybased approach[J].Annual Reviews in Control,2009,(01):11-24.
9Munos R. Geometric variance reduction in Markov chains:Application to value function and gradient estimation[J].Journal of Machine Learning Research,2006.413-427.
10左国玉,张红卫,韩光胜.基于多智能体强化学习的新强化函数设计[J].控制工程,2009,16(2):239-242. 被引量：4

引证文献2

1姚丽娜.复杂诊断系统的MAS分布式协作方法[J].广西师范大学学报（自然科学版）,2007,25(4):79-82.
2杨宛璐,陈玮,黄浩晖,王广涛.性能势算法研究及在RoboCup中的应用[J].计算机工程与设计,2014,35(3):905-908.

1老董.ISA Server学习问答[J].网管员世界,2011(7):120-123.
2唐忠平,李雅文,郑乐意,沈波.基于J2EE和Ionic的ITer学习APP设计与开发[J].电脑知识与技术,2016,12(7X):85-86. 被引量：1
3VR/AR产品未来路在何方[J].中外玩具制造,2017,0(2):20-21.
4杨滨,杨晓伟,黄岚,梁艳春,周春光,吴春国.自适应迭代最小二乘支持向量机回归算法[J].电子学报,2010,38(7):1621-1625. 被引量：14
5GooleReader的本土替代品鲜果阅读器[J].商业故事（数字通讯）,2013(15):110-110.
6初旭新,韩光胜.LWR学习在足球机器人中的应用[J].哈尔滨工业大学学报,2005,37(7):956-958.
7赵良辉,熊作贞.基于R学习的合同网实时调度模型[J].计算机工程与应用,2014,50(10):221-226. 被引量：1
8卢修元,周竹荣,奚晓霞.基于WC-C-R学习资源推荐的研究[J].计算机工程与设计,2006,27(23):4461-4464. 被引量：1
9图说[J].信息安全与通信保密,2016,14(1):14-15. 被引量：1
10朱晓岚,宋力,刘遇哲.基于DPI识别的应用流量误差研究[J].计算机与网络,2015,41(8):38-40.

广西师范大学学报（自然科学版）

2006年第4期

浏览历史

内容加载中请稍等...

一种基于性能势的无折扣强化学习算法被引量：2

参考文献9

同被引文献11

引证文献2

相关作者

相关机构

相关主题

浏览历史

一种基于性能势的无折扣强化学习算法 被引量：2

参考文献9

同被引文献11

引证文献2

相关作者

相关机构

相关主题

浏览历史

一种基于性能势的无折扣强化学习算法被引量：2