基于概率模型的动态分层强化学习被引量：2

Dynamic hierarchical reinforcement learning based on probability model

下载PDF

导出

摘要为解决大规模强化学习中的"维度灾难"问题,克服以往学习算法的性能高度依赖于先验知识的局限性,本文提出一种基于概率模型的动态分层强化学习方法.首先基于贝叶斯学习对状态转移概率进行建模,建立基于概率参数的关键状态识别方法,进而通过聚类动态生成若干状态子空间和学习分层结构下的最优策略.仿真结果表明该算法能显著提高复杂环境下智能体的学习效率,适用于未知环境中的大规模学习. To deal with the overwhelming dimensionality in the large-scale reinforcement-learning and the strong depen-dence on prior knowledge in existing learning algorithms,we propose the method of dynamic hierarchical reinforcement learning based on the probability model（DHRL--model）.This method identifies some key states automatically based on probability parameters of the state-transition probability model established based on Bayesian learning,then generates some state-subspaces dynamically by clustering,and learns the optimal policy based on hierarchical structure.Simulation results show that DHRL--model algorithm improves the learning efficiency of the agent remarkably in the complex environment,and can be applied to learning in the unknown large-scale world.

作者戴朝晖袁姣红吴敏陈鑫

机构地区中南大学信息科学与工程学院

出处《控制理论与应用》 EI CAS CSCD 北大核心 2011年第11期1595-1600,1606,共7页 Control Theory & Applications

基金国家自然科学基金资助项目(60874042) 中国博士后科学基金一等资助项目(20080440177) 中国博士后科学基金特别资助项目(200902483) 教育部高等学校博士点基金新教师基金资助项目(20090162120068)

关键词动态分层强化学习贝叶斯学习状态转移概率模型智能体 dynamic hierarchical reinforcement-learning Bayesian learning state-transition probability model agent

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献16

1高阳,陈世福,陆鑫.强化学习研究综述[J].自动化学报,2004,30(1):86-100. 被引量：268
2KAELBLING L P, LITTMAN M L. Reinforcement learning: a sur- vey[J]. Journal ofArtificiallntelligence Research, 1996, 4(1): 237- 285.
3STRENS M. A Bayesian framework for reinforcement learning[C] //Proceeedings of the 17th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000:943 -950.
4沈晶,顾国昌,刘海波.分层强化学习中的动态分层方法研究[J].小型微型计算机系统,2007,28(2):287-291. 被引量：1
5彭志平,李绍平.分层强化学习研究进展[J].计算机应用研究,2008,25(4):974-978. 被引量：7
6SUTTON R S, PRECUP D, SINGH S. Between MDPs and Semi- MDPs: a framework for temporal abstraction in reinforcement learn- ing[J]. Artificial Intelligence, 1999, 112(1): 181 - 211.
7PARR R E. Hierarchical control and learning for Markov decision processes[D]. Berkeley, CA: University of California, 1998.
8DIETTERICH T G, Hierarchical reinforcement learning with the MAXQ value function decomposition[J]. Journal of Artificial Intelli- gence Research, 2000, 13(1): 227 - 303.
9HENGST B. Discovering hierarchy in reinforcement learning with HEXQ[C]//Proceedings of the Nineteenth International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 2002:243 - 250.
10JONG N K, STONE E Hierarchical model-based reinforcement learning: R-MAX + MAXQ[C]//Proceedings of the 25th Interna- tional Conference on Machine Learning. New York: ACM, 2008: 432 - 439.

二级参考文献60

1WEILI QingtaiYE ChangmingZHU.APPLICATION OF HIERARCHICAL REINFORCEMENT LEARNING IN ENGINEERING DOMAIN[J].Journal of Systems Science and Systems Engineering,2005,14(2):207-217. 被引量：3
2Barto A G,Mahadevan S.Recent advances in hierarchical reinforcement learning[J].Discrete Event Dynamic Systems:Theory and Applications,2003,13(4):41-77.
3Sutton R S,Precup D,Singh S P.Between MDPs and semi-MDPs:a framework for temporal abstraction in reinforcement learning[J].Artificial Intelligence,1999,112(1):181-211.
4Parp R.Hierarchical control and learning for markov decision processes[D].Berkeley:University of California,1998.
5Dietterich T G.Hierarchical reinforcement learning with the MAXQ value function decomposition[J].Journal of Artificial Intelligence Research,2000,13(1):227-303.
6Digney B L.Learning hierarchical control structures for multiple tasks and changing environments[C].In:Proc.of the 5th International Conference on Simulation of Adaptive Behavior,Zurich,Switzerland,1998:321-330.
7Mcgovern A,Barto A.Autonomous discovery of subgoals in reinforcement learning using diverse density[C].In:Proc.of the 8th International Conference on Machine Learning,San Fransisco:Morgan Kaufmann,2001:361-368.
8Menache I,Mannor S,Shimkin N.Q-cut:dynamic discovery of sub-goals in reinforcement learning[C].In:Proc.the 13th European Conference on Machine Learning,Helsinki,Finland,2002:295-306.
9Mannor S,et al.Dynamic abstraction in reinforcement learning via clustering[C].In:Proc.of the 21th International Conference on Machine Learning,Banff,Canada,2004:560-567.
10Precup D.Temporal abstraction in reinforcement learning[D].Amherst:University of Massachusetts,2000.

共引文献273

1项宇,秦进,袁琳琳.结合向前状态预测和隐空间约束的强化学习表示算法[J].计算机系统应用,2022,31(11):148-156. 被引量：4
2安萌萌,樊秀梅,蔡含宇.基于雾计算和强化学习的交通灯智能协同控制研究[J].计算机应用研究,2020,37(2):465-469. 被引量：8
3丁志梁,潘毅群(指导),谢建彤,王尉同,黄治钟.强化学习算法在空调系统运行优化中的应用研究[J].建筑节能,2020(7):14-20. 被引量：7
4王彦朋,郭佳佳,王晓君.基于Q-Learning的青霉素发酵过程控制方法[J].信息化研究,2023,49(3):31-35.
5马庆刘,喻鹏,吴佳慧,熊翱,颜拥.基于深度强化学习的综合能源业务通道优化机制[J].北京邮电大学学报,2020,43(2):87-93. 被引量：1
6赵元,张合新.基于目标状态距离简化Q-learning算法的迷宫路径规划[J].火箭军工程大学学报,2019(4):79-84.
7宋炯,金钊,杨维和.机器学习中加速强化学习的一种函数方法[J].云南大学学报（自然科学版）,2011,33(S2):176-181.
8周济,陈锋.基于强化神经网络的区域协调控制研究[J].电子技术（上海）,2010(9):20-22.
9卓睿,陈宗海,陈春林.基于强化学习和模糊逻辑的移动机器人导航[J].计算机仿真,2005,22(8):157-162. 被引量：5
10魏英姿 ,赵明扬 .一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771. 被引量：19

同被引文献31

1SUTTON R S, BARTO A G. Reinforcement Learning: An Introduction[M]. Cambridge Massachusetts: MIT Press, 1998.
2LIU C, XU X, HU D. Multiobjective reinforcement learning: a com-prehensive overview[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2013, 99(4): 1 - 13.
3WIERING M, OTTERLO M V. Reinforcement Learning State of the Art[M]. Berlin: Springer-Verlag, 2012, 10(3): 325 - 331.
4LUCIAN B, ROBERT B, BART D S. Reinforcement Learning and Dynamic Programming Using Function Approximators[M]. New York: CRC Press, 2010.
5DRIES S VAN DEN, WIERING M A. Neural-fitted TD-Ieaflearning for playing othello with structured neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(11): 1701 - 1713.
6SUTTON R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988,3(1): 9 - 44.
7DAYAN P, SEJNOWSKI T J. TD().) converges with probability 1[J]. Machine Learning, 1994, 14(1): 295 - 301.
8MIROLLI M, SANTUCCI V G, BALDASSARRE G. Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study[J]. Neural Networks, 2013, 39(3): 40 - 51.
9BHASIN S, KAMALAPURKAR R, JOHNSON M, et al. A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems[J]. Automatica, 2013, 49(1): 82 - 92.
10BAIRD L C. Residual algorithms: reinforcement learning with function approximation[C] //Proceedings of the 12th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 1995: 30 - 37.

引证文献2

1陈鑫,魏海军,吴敏,曹卫华.基于高斯回归的连续空间多智能体跟踪学习[J].自动化学报,2013,39(12):2021-2031. 被引量：2
2刘智斌,曾晓勤,徐彦,禹继国.采用资格迹的神经网络学习控制算法[J].控制理论与应用,2015,32(7):887-894. 被引量：4

二级引证文献6

1吕红芳,顾幸生.基于蚁群神经网络的两级信息融合算法[J].上海交通大学学报,2016,50(8):1323-1330. 被引量：17
2牛亚东,储健,李刚.一种新型淀粉含水量测量方法及仿真[J].天津职业技术师范大学学报,2016,26(3):30-33.
3魏倩,蔡远利.J_2项摄动影响下的大气层外弹道规划改进算法[J].控制理论与应用,2016,33(9):1245-1251. 被引量：4
4蒲志强,易建强,刘振,丘腾海,孙金林,李非墨.知识和数据协同驱动的群体智能决策方法研究综述[J].自动化学报,2022,48(3):627-643. 被引量：27
5司彦娜,普杰信,孙力帆.近似强化学习算法研究综述[J].计算机工程与应用,2022,58(8):33-44. 被引量：5
6刘孙相与,李贵涛,詹亚锋,高鹏.基于多阶运动参量的四旋翼无人机识别方法[J].自动化学报,2022,48(6):1429-1447. 被引量：3

1赵元东,陈学工,张中华.基于无线传感器网络(WSN)技术的设备管理研究[J].科技信息,2010(1):56-57. 被引量：1
2许岗,金海和,刘靖.时间敏感的机会网络社会关系拓扑演化研究[J].计算机科学与探索,2015,9(12):1483-1493. 被引量：1
3杜文峰,贾维嘉,林立东.服务切换模型及其性能研究[J].计算机工程与应用,2005,41(28):195-199.
4白尘.马尔可夫决策问题的关键状态优先学习算法[J].中国管理信息化,2016,19(7):198-202. 被引量：1
5陈鑫,魏海军,吴敏,曹卫华.基于高斯回归的连续空间多智能体跟踪学习[J].自动化学报,2013,39(12):2021-2031. 被引量：2
6宗丹,李淳芃,夏时洪,王兆其.基于关键状态的虚拟人组合任务分层规划方法[J].系统仿真学报,2013,25(7):1535-1542. 被引量：1
7代彬,陆刚,韩可琦.基于时间Petri网的实时嵌入式软件系统安全性分析[J].现代计算机,2001,7(6):21-24. 被引量：1
8赵英凯,熊辉,蔡宁.生产过程信息管理系统的开发[J].测控技术,2000,19(1):50-52. 被引量：1
9赵英凯,蔡宁,熊辉.DCS与LAN的数据传输及智能处理低成本开发[J].微计算机信息,2000,16(1):6-9.
10马琳茹,杨林,王建新.一种新的基于任务的安全量化评估方法[J].系统仿真学报,2007,19(15):3372-3375. 被引量：1

控制理论与应用

2011年第11期

浏览历史

内容加载中请稍等...

基于概率模型的动态分层强化学习被引量：2

参考文献16

二级参考文献60

共引文献273

同被引文献31

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

基于概率模型的动态分层强化学习 被引量：2

参考文献16

二级参考文献60

共引文献273

同被引文献31

引证文献2

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

基于概率模型的动态分层强化学习被引量：2