Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation 被引量：3

导出

摘要 Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning.However,existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces.This paper proposes a novel Dyna variant,called Dyna-LSTD-PA,aiming to handle problems with continuous action spaces.Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD)and policy approximation.Dyna-LSTD-PA consists of two simultaneous,interacting processes.The learning process determines the probability distribution over action spaces using the Gaussian distribution;estimates the underlying value function,policy,and model by linear representation;and updates their parameter vectors online by LSTD(,t).The planning process updates the parameter vector of the value function again by using ofttine LSTD(2).Dyna-LSTD-PA also uses the Sherman-Morrison formula to improve the efficiency of LSTD(,t),and weights the parameter vector of the value function to bring the two processes together.Theoretically,the global error bound is derived by considering approximation,estimation,and model errors.Experimentally,Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate,success rate,and stability performance on four benchmark RL problems.

作者 Shan ZHONG Quan LIU Zongzhang ZHANG Qiming FU

机构地区 School of Computer Science and Technology School of Computer Science and Engineering Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education Collaborative Innovation Center of Novel Software Technology and Industrialization College of Electronic &Information Engineering

出处《Frontiers of Computer Science》 SCIE EI CSCD 2019年第1期106-126,共21页 中国计算机科学前沿（英文版）

关键词 problem solving control METHODS HEURISTIC search METHODS dynamic PROGRAMMING

分类号 TP [自动化与计算机技术]

引文网络
相关文献

参考文献1

1De-Rong Liu,Hong-Liang Li,Ding Wang.Feature Selection and Feature Learning for High-dimensional Batch Reinforcement Learning: A Survey[J].International Journal of Automation and computing,2015,12(3):229-242. 被引量：2

二级参考文献9

1Dimitri P.BERTSEKAS.Approximate policy iteration:a survey and somenew methods[J].控制理论与应用（英文版）,2011,9(3):310-335. 被引量：6
2Derong Liu,Ding Wang,Xiong Yang.An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs[J]. Information Sciences . 2013
3Ding Wang,Derong Liu,Qinglai Wei,Dongbin Zhao,Ning Jin.Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming[J]. Automatica . 2012 (8)
4Amir-massoud Farahmand,Csaba Szepesvári.Model selection in reinforcement learning[J]. Machine Learning . 2011 (3)
5András Antos,Csaba Szepesvári,Rémi Munos.Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path[J]. Machine Learning . 2008 (1)
6A. Nedi?,D. P. Bertsekas.Least Squares Policy Evaluation Algorithms with Linear Function Approximation[J]. Discrete Event Dynamic Systems . 2003 (1-2)
7Dirk Ormoneit,?aunak Sen.Kernel-Based Reinforcement Learning[J]. Machine Learning . 2002 (2-3)
8Justin A. Boyan.Technical Update: Least-Squares Temporal Difference Learning[J]. Machine Learning . 2002 (2-3)
9Steven J. Bradtke,Andrew G. Barto.Linear Least-Squares Algorithms for Temporal Difference Learning[J]. Machine Learning . 1996 (1-3)

共引文献1

1谢宇,王庆龙,赵春霞.基于群搜索优化的特征子集选择[J].吉林工程技术师范学院学报,2018,34(8):34-36.

同被引文献9

1魏英姿 ,赵明扬 .强化学习算法中启发式回报函数的设计及其收敛性分析[J].计算机科学,2005,32(3):190-193. 被引量：13
2陈学松,杨宜民.强化学习研究综述[J].计算机应用研究,2010,27(8):2834-2838. 被引量：61
3Bohong Yang,Hong Lu,Baogen Li,Zheng Zhang,Wenqiang Zhang.A Novel Experience-Based Exploration Method for Q-Learning[J].国际计算机前沿大会会议论文集,2018(1):17-17. 被引量：1
4罗智勇,朱梓豪,尤波,苗世迪.基于串归约的时间约束下工作流精确率优化算法[J].哈尔滨理工大学学报,2018,23(5):68-74. 被引量：2
5李成严,曹克翰,冯世祥,孙巍.不确定执行时间的云计算资源调度[J].哈尔滨理工大学学报,2019,24(1):85-91. 被引量：8
6Xing CHEN,Junxin LIN,Yun MA,Bing LIN,Haijiang WANG,Gang HUANG.Self-adaptive resource allocation for cloud-based software services based on progressive QoS prediction model[J].Science China(Information Sciences),2019,62(11):188-190. 被引量：2
7Jieren Cheng,Ruomeng Xu,Xiangyan Tang,Victor S.Sheng,Canting Cai.An Abnormal Network Flow Feature Sequence Prediction Approach for DDoS Attacks Detection in Big Data Environment[J].Computers, Materials & Continua,2018(4):95-119. 被引量：20
8Neng Hou,Fazhi He,Yi Zhou,Yilin Chen.An efficient GPU-based parallel tabu search algorithm for hardware/software co-design[J].Frontiers of Computer Science,2020,14(5):135-152. 被引量：5
9ZHANG Liumei,WANG Yichuan,ZHU Lei,JI Wenjiang.Towards energy efficient cloud:an optimized ant colony model for virtual machine placement[J].Journal of Communications and Information Networks,2016,1(4):116-132. 被引量：1

引证文献3

1Zuocong Chen.Cloud Resource Allocation Based on Deep Q-Learning Network[J].国际计算机前沿大会会议论文集,2020(1):666-675.
2李成严,孙巍,唐立民.一种权重自适应的强化学习云资源调度算法[J].哈尔滨理工大学学报,2021,26(2):17-25. 被引量：6
3Peng YANG,Qi YANG,Ke TANG,Xin YAO.Parallel exploration via negatively correlated search[J].Frontiers of Computer Science,2021,15(5):123-135. 被引量：3

二级引证文献9

1陈皓炜,贾新春,孙小明,侯鹏飞.SCR脱硝系统的强化学习复合串级控制[J].动力工程学报,2022,42(5):421-428. 被引量：9
2方雷毅,石旺君.基于语料库的多维微学习资源均衡调度系统设计[J].现代电子技术,2022,45(19):83-87.
3陈昆,尚龙龙,罗金阁,余淼,左浩.考虑多交易场景协调的光储联合系统优化配置模型[J].哈尔滨理工大学学报,2022,27(5):89-96. 被引量：6
4杨海陆,赵鑫,陈晨,王莉莉.基于节点影响力扩张的社交网络社区发现算法[J].哈尔滨理工大学学报,2023,28(3):10-19. 被引量：1
5王立红,张延华,孟德彬,李萌.基于DDPG算法的云数据中心任务节能调度研究[J].高技术通讯,2023,33(9):927-936.
6Ting WU,Hong QIAN,Ziqi LIU,Jun ZHOU,Aimin ZHOU.Bi-objective evolutionary Bayesian network structure learning via skeleton constraint[J].Frontiers of Computer Science,2023,17(6):111-123.
7Peng YANG,Laoming ZHANG,Haifeng LIU,Guiying LI.Reducing idleness in financial cloud services via multi-objective evolutionary reinforcement learning based load balancer[J].Science China(Information Sciences),2024,67(2):16-36. 被引量：1
8Wenjing Hong,Guiying Li,Shengcai Liu,Peng Yang,Ke Tang.Multi-objective evolutionary optimization for hardware-aware neural network pruning[J].Fundamental Research,2024,4(4):941-950.
9亢中苗,吴赞红,张珮明,黄东海,包宇奔,卢文冰,张孙烜.基于SDN弹性光网络的电力通信网智能业务编排方法[J].哈尔滨理工大学学报,2024,29(3):99-106.

1Kun HE,mohammed DOSH,Yan JIN,Shenghao ZOU.Packing unequal circles into a square container based on the narrow action spaces[J].Science China(Information Sciences),2018,61(4):215-217.
2李静,洪鸿加,陈志良,彭晓春.Study on Dynamic Changes of Land Utilization and Landscape Layout in Pearl River Estuary[J].Journal of Landscape Research,2010,2(6):11-15. 被引量：1
3Yusong DU,Baodian WEI,Huang ZHANG.A rejection sampling algorithm for off-centered discrete Gaussian distributions over the integers[J].Science China(Information Sciences),2019,62(3):192-194. 被引量：2
4Special Issue on “Bioinformatics Commons”[J].Genomics, Proteomics & Bioinformatics,2018,16(6):468-468.
5李德领(翻译),刘宏友(校).第16届国际轨道车辆轮轨大会在德国德累斯顿召开[J].国外铁道车辆,2019,56(2):8-8.
6SONG Li,XIAO Ping.Construction of Eco-landscape Attraction's Spatial Alienation Model: A Case Study of Water Recreation Space[J].Journal of Landscape Research,2015,7(1):27-29.
7Feifan ZHOU,Wansuo DUAN,He ZHANG,Munehiko YAMAGUCHI.Possible Sources of Forecast Errors Generated by the Global/Regional Assimilation and Prediction System for Landfalling Tropical Cyclones. Part Ⅱ: Model Uncertainty[J].Advances in Atmospheric Sciences,2018,35(10):1277-1290. 被引量：3
8Changlong WANG,Zhiyong FENG,Xiaowang ZHANG,Xin WANG,Guozheng RAO,Daoxun FU.ComR:a combined OWL reasoner for ontology classification[J].Frontiers of Computer Science,2019,13(1):139-156. 被引量：1
9Xue Chong,Jia Peng,Zhang Xinyu.Steering control in autonomous vehicles using deep reinforcement learning[J].The Journal of China Universities of Posts and Telecommunications,2018,25(6):58-64. 被引量：1
10Shuiyun Shen,Yao Zhang,Guanghua Wei,Wansen Zhang,Xiaohui Yan,Guofeng Xia,Aiming Wu,Changchun Ke,Junliang Zhang.Li2FeSiO4/C hollow nanospheres as cathode materials for lithium-ion batteries[J].Nano Research,2019,12(2):357-363. 被引量：3

Frontiers of Computer Science

2019年第1期

浏览历史

内容加载中请稍等...