摘要
针对强化学习不动点的解更优这一问题,提出广义不动点解模型设计,该设计使用n步自举法的不动点解扩展和基于线性插值法的不动点解构造方法。将该设计应用于成熟的CBMPI算法框架上,提出基于广义不动点的CBMPI(n,β)算法。针对如何表达并逼近最优解这一问题,提出基于贝叶斯优化的广义不动点解的参数优化和基于集成学习的更高质量的解。在经典的10×10规模的Tetris游戏环境中验证算法提出的有效性。试验结果证明了基于线性插值法的广义不动点构造能比n步传统不动点效果好,其效果与其超参数步长n和插值参数β有很大关联。在100局的Tetris游戏中,平均分达到4 388.3,表明贝叶斯优化技术可以找到多组表现优异的策略。对表现优异的四组广义不动点的策略参数(贝叶斯优化技术的结果)进行策略集成和值函数集成,得到更高质量的解。平均分可以分别达到4 526.29和4 579.74,试验结果表明基于广义不动点的策略集成和基于广义不动点的值函数集成的分数相较于广义不动点的分数有小幅度提高,证实了可以通过集成学习寻找更高质量的解。
A generalized fixed-point solution model was proposed to address the question of what kind of reinforcement learning fixed-point solution was better.This design employed the extension of fixed-point solutions using n-step bootstrapping and constructed fixed-point solutions based on linear interpolation.This design was applied to the mature CBMPI algorithm framework,introducing the CBMPI(n,β)algorithm based on generalized fixed-points.Addressing the issue of expressing and approximating the optimal solution,optimization of parameters for generalized fixed-point solutions was proposed based on Bayesian optimization,and higher-quality solutions through ensemble learning were suggested.The effectiveness of the proposed algorithms was verified in the classical 10x10 Tetris game environment.Experimental results showed that the generalized fixed-point construction based on linear interpolation had outperformed the traditional n-step fixed-point method,and its performance was significantly associated with hyperparameters such as the step length n and interpolation parameterβ.Over 100 games of Tetris,an average score of 4388.3 was achieved,which indicated that Bayesian optimization techniques could identify multiple sets of outstanding strategies.By integrating strategies from four sets of outstanding generalized fixed-point parameters(results from Bayesian optimization techniques)and integrating value functions,higher-quality solutions were obtained.Average scores reached 4526.29 and 4579.74 respectively,which demonstrated that policy ensemble based on generalized fixed-points and value function ensemble based on generalized fixedpoints marginally improved scores compared to other generalized fixed-point policies.This confirmed the potential of ensemble learning to discover higher-quality solutions.
作者
陈兴国
吕咏洲
巩宇
陈耀雄
CHEN Xingguo;LÜ Yongzhou;GONG Yu;CHEN Yaoxiong(Jiangsu Key Laboratory of Big Data Security&Intelligent Processing,Nanjing University of Posts and Telecommunications,Nanjing 210023,Jiangsu,China;National Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210046,Jiangsu,China;Faculty of Electronic Information Engineering,Huaiyin Institute of Technology,Huaian,223003,Jiangsu,China)
出处
《山东大学学报(工学版)》
CAS
CSCD
北大核心
2024年第4期21-34,共14页
Journal of Shandong University(Engineering Science)
基金
国家自然科学基金资助项目(62276142,62206133,62202240,62192783)
科技创新2030——“新一代人工智能”重大项目资助项目(2018AAA0100905)
江苏省产业前瞻与关键核心技术竞争资助项目(BE2021028)
深圳市中央引导地方科技发展资金资助项目(2021Szvup056)。
关键词
强化学习
值函数近似估计
不动点
贝叶斯优化
俄罗斯方块
reinforcement learning
value function approximation
fixed point
Bayesian optimization
Tetris