一种基于动作采样的Q学习算法被引量：1

An Action-sampling Based Q-learning Algorithm

下载PDF

导出

摘要强化学习使用马尔可夫决策过程的形式化框架,使用状态、动作和奖励定义学习型智能体与环境的交互过程。多智能体强化学习存在联合动作数随智能体个数的增加呈指数级增长的问题。为缓解此问题,提出一种基于动作采样的Q学习(action-sampling based Q-learning,ASQ)算法。该算法采用集中训练-分散执行的框架,在集中训练阶段更新联合动作Q值时并没有遍历所有联合动作Q值,而只对部分联合动作Q值进行采样。在动作选择和执行阶段,每个智能体又独立选择动作,有效减少了学习阶段的计算量。实验结果表明,该算法能够以100%的成功率学习到最优联合策略。 Reinforcement learning uses the formal framework of Markov decision process,and uses states,actions and reward to define the interaction process between a learning agent and the environment.In multi-agent reinforcement learning,the number of joint actions increases exponentially with the increase of the number of agents.To alleviate this problem,an action-sampling based Q-learning(ASQ)algorithm is proposed.The algorithm adopts the framework of centralized training and decentralized execution.When updating Q-values of joint actions in centralized training phase,it does not find all the Q-values of joint actions,but only samples some of the Q-values of joint actions.In the action selection and execution stage,each agent selects the action independently,which greatly reduces the computation amount in the learning stage.The experimental results show that the proposed algorithm can learn the optimal joint strategy with 100%success rate.

作者赵德京马洪聪廖登宇崔浩岩 ZHAO Dejing;MA Hongcong;LIAO Dengyu;CUI Haoyan(School of Automation,Qingdao University,Qingdao 266071,China;Qingdao Petrochemical Maintenance and Installation Engineering Co.,Ltd.,Qingdao 266043,China)

机构地区青岛大学自动化学院青岛石化检修安装工程有限责任公司

出处《控制工程》 CSCD 北大核心 2024年第1期70-79,共10页 Control Engineering of China

基金青岛市博士后应用研究项目。

关键词多智能体强化学习强化学习 Q学习动作采样 Multi-agent reinforcement learning reinforcement learning Q-learning action-sampling

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献4

1刘辉,肖克,王京擘.基于多智能体强化学习的多AGV路径规划方法[J].自动化与仪表,2020,35(2):84-89. 被引量：10
2宋勇,李贻斌,李彩虹.移动机器人路径规划强化学习的初始化[J].控制理论与应用,2012,29(12):1623-1628. 被引量：27
3段勇,崔宝侠,徐心和.多智能体强化学习及其在足球机器人角色分配中的应用[J].控制理论与应用,2009,26(4):371-376. 被引量：27
4邢海霞,程乐.一种基于强化学习的深度信念网络设计方法[J].控制工程,2019,36(11):2115-2120. 被引量：4

二级参考文献28

1赵红,李雅菊,宋涛.基于贝叶斯网络的工程项目风险管理[J].沈阳工业大学学报（社会科学版）,2008,1(3):239-244. 被引量：26
2李晓毅,徐兆棣.增量式贝叶斯分类的原理和算法[J].沈阳工业大学学报,2006,28(4):422-425. 被引量：7
3KIM J H, VADAKEPAT E Multi-agent systems: a survey from the robot-soccer perspective[J]. International Journal of Intelligent Automation and Soft Computing, 2000, 6(1) : 3 - 17.
4STONE P, VELOSO M. Multiagent systems: a survey from a machine learning perspective[J]. Autonomous Robots, 2000, 8(3) : 345 - 383.
5ERFU Y, DONGBING G. Multiagent reinforcement learning for multirobot systems: a survey[R]. Technical Report CSM-404, Department of Computer Science, University of Essex, 2004.
6LITrMAN M L. Markov games as a framework for multiagent learning[C] // Proceeding of the 11th International Conference on Machine Learning. San Francisco: IEEE, 1994, 157 - 163.
7HU J L, WELLMAN M E Multiagent reinforcement learning: theoretical framework and an algorithm[C]//Proceeding of the 15th International Conference of Machine Learning. San Francisco: IEEE, 1998, 115 - 122.
8SUTI'ON R S, BATRO A G. Reinforcement Learning: An Introduction[M]. Cambridge, Massachusetts: MIT, 1998.
9DOMINGOS P, PAZZANI M. On the optimality of the simple bayesian classifier under zero-one loss[J]. Machine Learning, 1997, 29(2/3): 103 -130.
10JOUFFE L. Fuzzy inference system learning by reinforcement methods[J]. IEEE Transaction on Systems, Man, and Cybernetics, 1998, 28(3): 338 - 355.

共引文献61

1金翔,王天霖,于鹏垚,赵勇.基于值迭代网络的路径规划算法[J].华中科技大学学报（自然科学版）,2020,48(2):91-96. 被引量：1
2刘洋,李建军.深度确定性策略梯度算法优化[J].辽宁工程技术大学学报（自然科学版）,2020(6):545-549. 被引量：2
3邓本再,张中景,王江银.基于最优化模糊逻辑的Robocup中型组动态角色分配[J].计算技术与自动化,2011,30(1):50-53. 被引量：1
4常晓军.基于联合强化学习的RoboCup-2D传球策略[J].计算机工程与应用,2011,47(23):212-216.
5吴军,徐昕,王健,贺汉根.面向多机器人系统的增强学习研究进展综述[J].控制与决策,2011,26(11):1601-1610. 被引量：22
6秦童.基于CMAC的Q算法在机器人足球中的应用[J].电子测试,2012,23(4):76-80.
7宋勇,李贻斌,李彩虹.移动机器人路径规划强化学习的初始化[J].控制理论与应用,2012,29(12):1623-1628. 被引量：27
8娄云峰,陈斌.基于态势评估的足球机器人决策模型[J].计算机技术与发展,2013,23(9):99-102. 被引量：1
9余涛,张水平.基于5要素试错更新算法SARSA(λ)的自动发电控制[J].控制理论与应用,2013,30(10):1246-1251. 被引量：2
10张晓文,侯媛彬,王维.移动机器人路径规划的人工免疫势场算法研究[J].自动化仪表,2013,34(12):5-8. 被引量：3

同被引文献4

1王永贵,李倩玉.基于KNN-GBDT的混合协同过滤推荐算法[J].计算机工程与应用,2021,57(9):103-108. 被引量：9
2冯云霞,薛蓉蓉.基于XGBoost的以太坊交易智能定价模型[J].计算机工程与应用,2022,58(20):263-269. 被引量：1
3赖丽娜,米瑜,周龙龙,饶季勇,徐天阳,宋晓宁.生成对抗网络与文本图像生成方法综述[J].计算机工程与应用,2023,59(19):21-39. 被引量：9
4蒋洪迅,江俊毅,梁循.基于机器学习的信用卡交易欺诈检测研究综述[J].计算机工程与应用,2023,59(21):1-25. 被引量：4

引证文献1

1郑越,代琪,施永辉,韩阳,陈丽芳.基于混合采样和强化学习的信用卡欺诈检测模型[J].华北理工大学学报（自然科学版）,2024,46(3):131-140.

1杨粲然.一个解释自然选择相关问题的形式化框架评《进化和选择的层次》[J].科学文化评论,2023,20(4):120-125.

控制工程

2024年第1期

浏览历史

内容加载中请稍等...

一种基于动作采样的Q学习算法被引量：1

参考文献4

二级参考文献28

共引文献61

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于动作采样的Q学习算法 被引量：1

参考文献4

二级参考文献28

共引文献61

同被引文献4

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于动作采样的Q学习算法被引量：1