一种基于启发式奖赏函数的分层强化学习方法被引量：11

A Hierarchical Reinforcement Learning Method Based on Heuristic Reward Function

下载PDF

导出

摘要针对强化学习在应用中经常出现的"维数灾"问题,即状态空间的大小随着特征数量的增加而发生指数级的增长,以及收敛速度过慢的问题,提出了一种基于启发式奖赏函数的分层强化学习方法.该方法不仅能够大幅度减少环境状态空间,还能加快学习的收敛速度.将此算法应用到俄罗斯方块的仿真平台中,通过对实验中的参数进行设置及对算法性能进行分析,结果表明:采用启发式奖赏函数的分层强化学习方法能在一定程度上解决"维数灾"问题,并具有很好的收敛速度. Reinforcement learning is about controlling an autonomous agent in an unknown enviroment—often called the state space. The agent has no prior knowledge about the environment and can only obtain some knowledge by acting in the environment. Reinforcement learning, and Q-learning particularly, encounters a major problem. Learning the Q-function in tablular form may be infeasible because the amount of memory needed to store the table is excessive, and the Q-function converges only after each state being visited a lot of times. So ＂curse of dimensionality＂ is inevitably produced by large state spaces. A hierarchical reinforcement learning method based on heuristic reward function is proposed to solve the problem of ＂curse of dimensionality＂, which make the states space grow exponentially by the number of features and slow down the convergence speed. The method can reduce state spaces greatly and quicken the speed of the study. Actions are chosen with favorable purpose and efficiency so as to optimize the reward function and quicken the convergence speed. The Tetris game is applied in the method. Analysis of algorithms and the experiment result show that the method can partly solve the ＂curse of dimensionality＂ and quicken the convergence speed prominently.

作者刘全闫其粹伏玉琛胡道京龚声蓉

机构地区苏州大学计算机科学与技术学院

出处《计算机研究与发展》 EI CSCD 北大核心 2011年第12期2352-2358,共7页 Journal of Computer Research and Development

基金国家自然科学基金项目(60873116 61070223 61070122) 江苏省自然科学基金项目(BK2008161 BK2009116) 江苏省高校自然科学研究基金项目(09KJA520002) 江苏省现代企业信息化应用支撑软件工程技术研究开发中心基金项目(SX200804)

关键词分层强化学习试错启发式奖赏函数俄罗斯方块 “维数灾” hierarchical reinforcement learning trial-and-error heuristic reward function Tetris curse of dimensionality

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献15

1Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning [J]. Discrete Event Dynamic Systems: Theory and Applications, 2003, 13(4): 41-77.
2Sutton R S, Precup D, Singh S P. Between MDPs and semi- MDPs : A framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence, 1999, 112 (1) : 181-211.
3Dietterich T G. Hierarchical reinforcement learning with the MAXQ value function decomposition[J]. Journal of Artificial Intelligence Research, 2000, 13(1): 227-303.
4Parr R. Hierarchical control and learning for Markov decision processes [D]. Berkeley: University of California, 1998.
5Neville M, Sriraam N. Transfer in variable-reward hierarchical reinforcement learning [J]. Machine Learning, 2008, 73(5): 289-312.
6Schultink E G, Cavallo R. Economic hierarchical Qqearning [C]//Proc of the 23rd AAAI Conf on Artificial Intelligence. New York: ACM, 2008.
7Mannor S, Menache I, Hoze I, et al. Dynamic abstraction in reinforcement learning via clustering [C] //Proc of the 21st Int Conf on Machine Learning. New York: ACM, 2004: 560 -567.
8Stolle M, Precup D. Learning options in reinforcement learning [C]//Proc of the 5th Int Symp on Abstraction, Reformulation and Approximation. Berlin: Springer, 2002: 212-285.
9苏畅,高阳,陈世福,陈兆乾.基于SMDP环境的自主生成options算法的研究[J].模式识别与人工智能,2005,18(6):679-684. 被引量：9
10Simsek O, Wolfe A P, Barto A G. Identifying useful subgoals in reinforcement learning by local graph partitioning [C] //Proc of the Int Conf on Machine Learning. New York: ACM, 2005:248-256.

二级参考文献38

1沈晶,顾国昌,刘海波.分层强化学习研究综述[J].模式识别与人工智能,2005,18(5):574-581. 被引量：7
2Sanner S, Boutilier C. Approximate linear programming for first order mdps [C] //The 21st Conf on Uncertainty in Artificial Intelligence. Amsterdam, Netherland: North Holland Publishing Company, 2005
3Dabney W, Govern A M. Utile distinctions for reinforcement learning [C]//The 20th Int Joint Artificial Intelligence. Singapore: World Scientific P Company, 2007 relational Conf on ublishing
4Croonenborghs T, Ramon J, Blockeel H, et al. Online learning and exploiting relational models in reinforcement learning [C] //The 20th Int Joint Conf on Artificial Intelligence. Singapore: World Scientific Publishing Company, 2007
5Barto A, Mahadevan S. Recent advances in hierarchical reinforcement learning [J]. Discrete Event Dynamic Systems: Theory and Applications, 2003, 13(4):41-77
6Tadepalli P, Givan R, Driessens K. Relational reinforcement learning: An overview [C]//ICML-04 Workshop on Relational RL. Boston: PWS Publishing Company, 2004
7Sanner S. Simultaneous learning of structure and value in relational reinforcement learning [C]//ICML'05 Workshop on Rich Representations for Reinforcement Learning. San Francisco: Morgan Kaufmann, 2005
8Landwehr N, Kersting K, de raedt L. nFOIL: Integrating naive Bayes and FOIL [J]. Journal of Machine Learning Research, 2007, 8(5): 481-507
9Driessens K, Ramon J, Gartner T. Graph kernels and Gaussican processes for relational reinforcement learning [M]. Amsterdam: Kluwer Academic Publishers, 2006
10Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: A survey [J]. Journal of Artificial Intelligence Research, 1996, 4(2): 237-285

共引文献25

1陈启军,肖云伟.基于行动分值的强化学习与奖赏优化[J].同济大学学报（自然科学版）,2007,35(3):411-411. 被引量：1
2陈启军,肖云伟.基于行动分值的强化学习与奖赏优化[J].同济大学学报（自然科学版）,2007,35(4):531-536. 被引量：1
3彭志平,李绍平.一种基于PSO的分层策略搜索算法[J].模式识别与人工智能,2008,21(1):98-103. 被引量：1
4杜小勤,李庆华,韩建军.一种基于HAMs体系的层次分解方法[J].小型微型计算机系统,2008,29(4):653-658.
5石川,史忠植,王茂光.基于路径匹配的在线分层强化学习方法[J].计算机研究与发展,2008,45(9):1470-1476. 被引量：4
6卞建勇,徐建闽,裴海龙.基于强化学习的视频车辆跟踪[J].华南理工大学学报（自然科学版）,2008,36(10):57-60. 被引量：3
7王国磊,钟诗胜,林琳.面向多机动态调度问题的两层Q学习算法[J].智能系统学报,2009,4(3):239-244. 被引量：1
8王忠巍,曹其新,栾楠,张蕾.基于强化学习的自主移动机器人反应式自救控制[J].上海交通大学学报,2009,43(11):1751-1755. 被引量：2
9陈学松,杨宜民.强化学习研究综述[J].计算机应用研究,2010,27(8):2834-2838. 被引量：62
10刘智斌,朱晓龙,曹宝香.一种自适应程序设计方法[J].计算机工程与应用,2011,47(36):80-82. 被引量：1

同被引文献92

1陈宗海,文锋,聂建斌,吴晓曙.基于节点生长k-均值聚类算法的强化学习方法[J].计算机研究与发展,2006,43(4):661-666. 被引量：13
2陈茂,陈小平.基于采样的POMDP近似算法[J].计算机仿真,2006,23(5):64-67. 被引量：2
3沈晶,顾国昌,刘海波.未知动态环境中基于分层强化学习的移动机器人路径规划[J].机器人,2006,28(5):544-547. 被引量：15
4Szepesvari Cs, Algorithms for Reinforcement I.earning [M]. San Rafael, California: Morgan Claypool, 2010.
5Busoniu L, Babuska R, Sehutter B D, et al. Reinforcement Learning and Dynamic Programming Using Functicm Approximators[M]. New York, CRff Press, 2010.
6Ross S M. Introduction to Stochastic Dynamic Programming[M]. New York: Academic Press, 1983.
7Cao X R. Stochastic Learning and Optimization: A Sensitivity Based Approach [M]. Berlin: Springer, 2007.
8Sutton R S, Barto A C-. Reinforcement Learning: An Introduction[M]. Cambridge: MIT Press, 1998.
9Szita I, Takcies B. Lorincz A. Epsilon-MDPs: Learning in varying environments [J]. Journal of Machine l.earning Research, 2003. 3(1): 145-174.
10SantosM, Martint A, LopezV, et al. Dyna H: Aheuristic planning reinforcement learning algorithm applied to role playing game strategy decision systems [J]. Knowledge- Based Systems, 2012, 32(1): 28-36.

引证文献11

1肖飞,刘全,傅启明,孙洪坤,高龙.基于自适应势函数塑造奖赏机制的梯度下降Sarsa(λ)算法[J].通信学报,2013,34(1):77-88. 被引量：6
2孙洪坤,刘全,傅启明,肖飞,高龙.一种优先级扫描的Dyna结构优化算法[J].计算机研究与发展,2013,50(10):2176-2184. 被引量：2
3穆翔,刘全,傅启明,孙洪坤,周鑫.基于两层模糊划分的时间差分算法[J].通信学报,2013,34(10):92-99. 被引量：1
4朱斐,刘全,傅启明,伏玉琛.一种用于连续动作空间的最小二乘行动者-评论家方法[J].计算机研究与发展,2014,51(3):548-558. 被引量：9
5周鑫,刘全,傅启明,肖飞.一种批量最小二乘策略迭代方法[J].计算机科学,2014,41(9):232-238. 被引量：9
6刘全,肖飞,傅启明,伏玉琛,周小科,朱斐.基于自适应归一化RBF网络的Q-V值函数协同逼近模型[J].计算机学报,2015,38(7):1386-1396. 被引量：9
7房俊恒,朱斐,刘全,伏玉琛,凌兴宏.一种基于独立任务的POMDP问题的解决方法[J].计算机应用研究,2016,33(1):147-152.
8栾咏红,刘全,章鹏.连续空间中的随机技能发现算法[J].现代电子技术,2016,39(10):14-17. 被引量：2
9朱斐,许志鹏,刘全,伏玉琛,王辉.基于可中断Option的在线分层强化学习方法[J].通信学报,2016,37(6):65-74. 被引量：4
10陈红名,刘全,闫岩,何斌,姜玉斌,张琳琳.基于经验指导的深度确定性多行动者-评论家算法[J].计算机研究与发展,2019,56(8):1708-1720. 被引量：6

二级引证文献49

1杨金鸿,谭斌,皇甫立,熊璋.一种基于联合神经网络的连续空间行动者评论家学习方法[J].智能安全,2022,1(2):19-25.
2于俊,刘全,傅启明,孙洪坤,陈桂兴.基于优先级扫描Dyna结构的贝叶斯Q学习方法[J].通信学报,2013,34(11):129-139. 被引量：6
3张国亮,王展妮,王田.应用计算机视觉的动态手势识别综述[J].华侨大学学报（自然科学版）,2014,35(6):653-658. 被引量：11
4刘智斌,曾晓勤,刘惠义,储荣.基于BP神经网络的双层启发式强化学习方法[J].计算机研究与发展,2015,52(3):579-587. 被引量：39
5钟珊,刘全,傅启明,章宗长,朱斐,龚声蓉.一种近似模型表示的启发式Dyna优化算法[J].计算机研究与发展,2015,52(12):2764-2775. 被引量：4
6朱斐,许志鹏,刘全,伏玉琛,王辉.基于可中断Option的在线分层强化学习方法[J].通信学报,2016,37(6):65-74. 被引量：4
7冯国杰.噪声环境下的网络异常信号检测方法研究[J].计算机仿真,2016,33(8):407-410. 被引量：2
8朱斐,刘全,傅启明,陈冬火,王辉,伏玉琛.一种不稳定环境下的策略搜索及迁移方法[J].电子学报,2017,45(2):257-266. 被引量：3
9徐圆,黄兵明,贺彦林.基于改进ELM的递归最小二乘时序差分强化学习算法及其应用[J].化工学报,2017,68(3):916-924. 被引量：6
10潘建平,黄文准,王盛玺,张香成.基于集群系统高频帧测速数据处理技术[J].电光与控制,2017,24(4):71-75.

1石川,史忠植,王茂光.基于路径匹配的在线分层强化学习方法[J].计算机研究与发展,2008,45(9):1470-1476. 被引量：4
2程晓北,沈晶,刘海波,顾国昌,张国印.分层强化学习研究进展[J].计算机工程与应用,2008,44(13):1-5. 被引量：1
3朱斐,许志鹏,刘全,伏玉琛,王辉.基于可中断Option的在线分层强化学习方法[J].通信学报,2016,37(6):65-74. 被引量：4
4张欣,戴帅.基于模糊聚类的分层强化学习算法[J].计算机工程与科学,2010,32(1):55-56.
5闫娟,杨慧斌,程武山.两轮机器人自平衡稳定性控制仿真研究[J].计算机仿真,2016,33(7):383-387. 被引量：5
6周文云,刘全,李志涛.一种大规模离散空间中的高斯强化学习方法[J].计算机科学,2009,36(8):247-249. 被引量：1
7安岭丽,彭志平,李铁鹰.MAXQ方法在出租车问题中的应用[J].茂名学院学报,2007,17(1):56-59.
8王蕾.一种基于示例轨迹的抽象动作树构造方法[J].计算机与现代化,2016(6):85-90. 被引量：1
9柯文德,彭志平,陈珂,项顺伯.基于分层Option的仿人机器人相似性关键姿势转换[J].计算机应用,2013,33(5):1301-1304.
10周晓柯,孙志毅,彭志平.基于蚂蚁优化算法的分层强化学习[J].计算机应用研究,2014,31(11):3214-3216.

计算机研究与发展

2011年第12期

浏览历史

内容加载中请稍等...

一种基于启发式奖赏函数的分层强化学习方法被引量：11

参考文献15

二级参考文献38

共引文献25

同被引文献92

引证文献11

二级引证文献49

相关作者

相关机构

相关主题

浏览历史

一种基于启发式奖赏函数的分层强化学习方法 被引量：11

参考文献15

二级参考文献38

共引文献25

同被引文献92

引证文献11

二级引证文献49

相关作者

相关机构

相关主题

浏览历史

一种基于启发式奖赏函数的分层强化学习方法被引量：11