期刊文献+

一种自适应的多臂赌博机算法 被引量:9

An Adaptive Algorithm in Multi-Armed Bandit Problem
下载PDF
导出
摘要 多臂赌博机问题是强化学习中研究探索和利用两者平衡的经典问题,其中,随机多臂赌博机问题是最经典的一类多臂赌博机问题,是众多新型多臂赌博机问题的基础.针对现有多臂赌博机算法未能充分使用环境反馈信息以及泛化能力较弱的问题,提出一种自适应的多臂赌博机算法.该算法利用当前估计值最小的动作被选择的次数来调整探索和利用的概率(chosen number of arm with minimal estimation, CNAME),有效缓解了探索和利用不平衡的问题.同时,该算法不依赖于上下文信息,在不同场景的多臂赌博机问题中有更好的泛化能力.通过理论分析给出了该算法的悔值(regret)上界,并通过不同场景的实验结果表明:CNAME算法可以高效地获得较高的奖赏和较低的悔值,并且具有更好的泛化能力. As an important ongoing field in machine learning,reinforcement learning has received extensive attention in recent years. The multi-armed bandit (MAB) problem is a typical problem of the exploration and exploitation dilemma in reinforcement learning. As a classical MAB problem,the stochastic multi-armed bandit (SMAB) problem is the base of many new MAB problems. To solve the problems of insufficient use of information and poor generalization ability in existing MAB methods,this paper presents an adaptive SMAB algorithm to balance exploration and exploitation based on the chosen number of arm with minimal estimation,namely CNAME in short. CNAME makes use of the chosen times and the estimations of an action at the same time,so that an action is chosen according to the exploration probability,which is updated adaptively. In order to control the decline rate of exploration probability,the parameter w is introduced to adjust the influence degree of feedback during the selection process. Furthermore,CNAME does not depend on contextual information,hence it has better generalization ability. The upper bound of CNAME s regret is theoretically proved and analyzed. Our experimental results in different scenarios show that CNAME can yield greater reward and smaller regret with high efficiency than commonly used methods. In addition,its generalization ability is very strong.
作者 章晓芳 周倩 梁斌 徐进 Zhang Xiaofang;Zhou Qian;Liang Bin;Xu Jin(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;State Key Laboratory for Novel Software Technology (Nanjing University),Nanjing 210023)
出处 《计算机研究与发展》 EI CSCD 北大核心 2019年第3期643-654,共12页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61772263 61772014 61572375) 苏州市科技发展计划基金项目(SYG201807)~~
关键词 强化学习 多臂赌博机 探索和利用 自适应 上下文相关 reinforcement learning multi-armed bandit exploration and exploitation adaptation contextual
  • 相关文献

参考文献6

二级参考文献76

  • 1Puterman M L.Markov Decision Process:Discrete Dynamic Dtochastic Programming.New-York:Wiley,1994
  • 2Kaya M,Alhajj R.Fuzzy olap association rules mining based modular reinforcement learning approach for multiagent systems.IEEE Transactions on Systems,Man and Cybernetics part B:Cybernetics,2005,35(2):326-338
  • 3Singh S,Bertsekas D.Reinforcement learning for dynamic channel allocation in cellular telephone systems//Mozer M C,Jordan M L,Petsche T.Proceedings of the NIPS-9.Cambridge MA:MIT Press,1997:974
  • 4Vengerov D N,Berenji H R.A fuzzy reinforcement learning approach to power control in wireless transmitters.IEEE Transactions on Systems,Man,and Cybernetics part B:Cybernetics,2005,35(4):768-778
  • 5Critesl R H,Barto A G.Elevator group control using multiple reinforcement learning Agents.Machine Learning,1998,33(2/3):235-262
  • 6Kaelbling L P,Littman M L,Moore A P.Reinforcement learning:A survey.Journal of Artificial Intelligence Research,1996,4:237-285
  • 7Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge MA:MIT Press,1998
  • 8Schwartz A.A reinforcement learning method for maximizing undiscounted rewards//Huns M N,Singh M P eds.Proceedings of the 10th Annual Conference on Machine Learning.San Francisco:Morgan Kaufmann,1993:298-305
  • 9Tadepalli P,Ok D.Model-based average reward reinforcement learning.Artificial Intelligence,1998,100(1/2):177-224
  • 10Gosavi A.Reinforcement learning for long run average cost.European Journal of Operational Research,2004,155 (3):654-674

共引文献113

同被引文献68

引证文献9

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部