合作-竞争混合型多智能体系统的虚拟遗憾优势自博弈方法

Counterfactual Regret Advantage-based Self-play Approach for Mixed Cooperative-competitive Multi-agent Systems

下载PDF

导出

摘要合作-竞争混合型多智能体系统由受控的目标智能体和不受控的外部智能体组成.目标智能体之间互相合作,同外部智能体展开竞争,应对环境和外部智能体的动态变化,最终完成指定的任务.针对如何训练目标智能体使他们获得完成任务的最优策略的问题,现有工作从两个方面展开:(1)仅关注目标智能体间的合作,将外部智能体视为环境的一部分,利用多智能体强化学习来训练目标智能体.这种方法难以应对外部智能体策略未知或者动态改变的情况;(2)仅关注目标智能体和外部智能体间的竞争,将竞争建模为双人博弈,采用自博弈的方法训练目标智能体.这种方法主要针对单个目标智能体和单个外部智能体的情况,难以扩展到由多个目标智能体和多个外部智能体组成的系统中.结合这两类研究,提出一种基于虚拟遗憾优势的自博弈方法.具体地,首先以虚拟遗憾最小化和虚拟多智能体策略梯度为基础,设计虚拟遗憾优势策略梯度方法,使目标智能体能更准确地更新策略;然后,引入模仿学习,以外部智能体的历史决策轨迹作为示教数据,模仿外部智能体的策略,显式地建模外部智能体的行为,来应对自博弈过程中外部智能体策略的动态变化;最后,以虚拟遗憾优势策略梯度和外部智能体行为建模为基础,设计一种自博弈训练方法,该方法能够在外部智能体策略未知或者动态变化的情况下,为多个目标智能体训练出最优的联合策略.以协同电磁对抗为研究案例,设计具有合作-竞争混合特征的3个典型任务.实验结果表明,同其他方法相比,所提方法在自博弈效果方面有至少78%的提升. The mixed cooperative-competitive multi-agent system consists of controlled target agents and uncontrolled external agents.The target agents cooperate with each other and compete with external agents,so as to deal with the dynamic changes in the environment and the external agents and complete tasks.In order to train the target agents and make them learn the optimal policy for completing the tasks,the existing work proposes two kinds of solutions:(1)focusing on the cooperation between target agents,viewing the external agents as a part of the environment,and leveraging the multi-agent-reinforcement learning to train the target agents;but these approaches cannot handle the uncertainty of or dynamic changes in the external agents’policy;(2)focusing on the competition between target agents and external agents,modeling the competition as two-player games,and using a self-play approach to train the target agents;these approaches are only suitable for cases where there is one target agent and external agent,and they are difficult to be extended to a system consisting of multiple target agents and external agents.This study combines the two kinds of solutions and proposes a counterfactual regret advantage-based self-play approach.Specifically,first,based on the counterfactual regret minimization and counterfactual multi-agent policy gradient,the study designs a counterfactual regret advantage-based policy gradient approach for making the target agent update the policy more accurately.Second,in order to deal with the dynamic changes in the external agents’policy during the self-play process,the study leverages imitation learning,which takes the external agents’historical decision-making trajectories as training data and imitates the external agents’policy,so as to explicitly model the external agents’behaviors.Third,based on the counterfactual regret advantage-based policy gradient and the modeling of external agents’behaviors,this study designs a self-play training approach.This approach can obtain the optimal joint policy for training multiple target agents when the external agents’policy is uncertain or dynamically changing.The study also conducts a set of experiments on the cooperative electromagnetic countermeasure,including three typical mixed cooperative-competitive tasks.The experimental results demonstrate that compared with other approaches,the proposed approach has an improvement of at least 78%in the self-game effect.

作者张明悦金芝刘坤 ZHANG Ming-Yue;JIN Zhi;LIU Kun(College of Computer and Information Science&School of Software,Southwest University,Chongqing 400715,China;School of Computer Science,Peking University,Beijing 100871,China;Key Lab of High Confidence Software Technologies(Peking University),Ministry of Education,Beijing 100871,China)

机构地区西南大学计算机信息科学学院&软件学院北京大学计算机学院高可信软件技术教育部重点实验室(北京大学)

出处《软件学报》 EI CSCD 北大核心 2024年第2期739-757,共19页 Journal of Software

基金国家自然科学基金(62192731)。

关键词多智能体强化学习虚拟遗憾最小化自博弈动态决策 multi-agent reinforcement learning counterfactual regret minimization self-play dynamic decision-making

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献3

1李翔,姜晓红,陈英芝,包友军.基于手牌预测的多人无限注德州扑克博弈方法[J].计算机学报,2018,41(1):47-64. 被引量：11
2胡裕靖,高阳,安波.不完美信息扩展式博弈中在线虚拟遗憾最小化[J].计算机研究与发展,2014,51(10):2160-2170. 被引量：8
3刘尚合,孙国至.复杂电磁环境内涵及效应分析[J].装备指挥技术学院学报,2008,19(1):1-5. 被引量：111

二级参考文献22

1刘尚合.武器装备的电磁环境效应及其发展趋势[J].装备指挥技术学院学报,2005,16(1):1-6. 被引量：96
2Anon. Electronic warfare[EB/OL]. (2007-01-25)[2007-02- 12]. http://www. dtic. mil/doctrine/jel/new puba/jp 3- 13. pdf.
3陈东.军事电磁频谱管理概论[M].北京:解放军出版社,2007:1-2.
4IANOZ M, WIPF H. Modeling and simulation methods to assess EM terrorism effeets[C]. Shanghai: The Seeond Asia-Paeifie Conferenee on Environmental Eleetromagneties Proeeedings, 2000 : 1-4.
5中国人民解放军总装备部技术基础管理中心.国军标GJB72A--2002,电磁干扰和电磁兼容性术语[s].北京:总装备部军标出版发行部,2002.
6Anon. Department of defense dictionary of military and associated terms [EB/OL]. ( 2007-10-17 ) [2007-12-10]. http :// www. dtic. mil/doctrine/jel/new_puba/jp 1-02. pdf.
7Anon. Joint doctrine for electronic warfare[EB/OL]. (2000- 04-07)[2007-12-11]. http://www. dtic. mil/doctrine/jel/ new puba/jp 3-51. pdf.
8Osborne M, Rubinstein A. A Course in Game Theory [M]. Cambridge, MA: MIT Press, 1994: 200-201.
9Billings D, Burch N, Davidson A, et al. Approximating game-theoretic optimal strategies for full-scale poker [C] // Proc of the 18th Int Joint Conf on Artificial Intelligence. Mahwah, NJ: Lawrence Erlbaum Associates, 2003: 661-668.
10Hoda S, Gilpin A, Pena J, et al. Smoothing techniques for computing nash equilibria of sequential games [J]. Mathematics of Operations Research, 2010, 35(2): 494-512.

共引文献125

1池建军,罗小明,孙祥年.复杂电磁环境对组网雷达“四抗”能力综合影响仿真评估[J].装甲兵工程学院学报,2013,27(1):69-76.
2赵振国,马弘舸,王艳.PIN限幅器微波脉冲效应数值模拟[J].微波学报,2012,28(S3):297-300. 被引量：4
3满梦华,褚杰,施威,原亮.嵌入式数字电路故障自修复技术研究[J].河北科技大学学报,2011,32(S1):142-144. 被引量：3
4王孝阳,杨明.基于mems的电磁仿生射频前端[J].河北科技大学学报,2011,32(S2):24-26.
5洪丽娜,何洪涛,蒙洁,闫京海.战场复杂电磁环境要素分析[J].河北科技大学学报,2011,32(S2):71-74. 被引量：8
6李晓宁,陈国富,侯志军.复杂电磁环境下武器装备训练效果评估研究[J].河北科技大学学报,2011,32(S2):218-221. 被引量：2
7孙国至,刘尚合,安霆,张勇强,贺其元.某电子系统的ESD抗扰度性能研究[J].中国电子科学研究院学报,2008,3(4):351-354. 被引量：5
8徐永成,罗日荣,陈循,陶利民,宋昆.复杂电磁环境下装备损伤模式与保障问题研究[J].国防科技,2008,29(4):27-33. 被引量：7
9刘尚合,原亮,褚杰.电磁仿生学—电磁防护研究的新领域[J].自然杂志,2009,31(1):1-7. 被引量：54
10周明军,孙宏.装备保障训练复杂电磁环境构建探析[J].装备指挥技术学院学报,2009,20(1):12-15. 被引量：9

1王可晴,徐一,谌宇砾,张泽宇.基于深度学习的AI桌面冰球机器人[J].中文科技期刊数据库（全文版）工程技术,2022(5):155-158.
2唐平娟,林常青,郭娟.基于CiteSpace国内数字贸易研究热点和趋势分析[J].现代商贸工业,2024,45(1):43-46.
3陈璐.广东养老PPP服务模式及融资结构[J].中国外资,2023(12):71-73.
4闫伟.育人始于观察[J].北京教育（普教版）,2024(2):1-1.
5游怡,沈金波,朱文娟.博弈论视角下消费者非理性行为影响因素分析[J].中文科技期刊数据库（全文版）社会科学,2023(6):193-196.
6黄香莹.小学高年级英语小组合作学习模式探究[J].课堂内外（初中教研）,2023(10):69-71.
7顾潇.任务教学法在司法类专业英语教学中的应用研究[J].齐齐哈尔高等师范专科学校学报,2023(5):150-152.
8黄志艳,张旭升,巩文桥,刘娟,陆丽娟.鼻渊通窍颗粒联合曲安奈德鼻喷雾剂治疗慢性鼻-鼻窦炎患者的成本-效果分析[J].中国中药杂志,2023,48(15):4237-4242. 被引量：3
9汪清,陈琪,王浩智,张峰,董志诚.未知博弈范式下的通用电磁对抗策略自动生成研究[J].电子与信息学报,2023,45(11):4072-4082.
10王刚,王洪亮,皮大伟,孙晓旺,王显会.增程式电动物流车队列的能量管理策略研究[J].农业装备与车辆工程,2024,62(2):53-58.

软件学报

2024年第2期

浏览历史

内容加载中请稍等...

合作-竞争混合型多智能体系统的虚拟遗憾优势自博弈方法

参考文献3

二级参考文献22

共引文献125

相关作者

相关机构

相关主题

浏览历史