摘要
围绕两人零和博弈所开展的一系列研究,近年来在围棋、德州扑克等问题中取得了里程碑式的突破.现有的两人零和博弈求解方案大多在理性对手的假设下围绕纳什均衡解开展,是一种力求不败的保守型策略,但在实际博弈中由于对手非理性等原因并不能保证收益最大化.对手建模为最大化博弈收益提供了一种新途径,但仍存在建模困难等问题.结合元学习的思想提出了一种能够快速适应对手策略的元策略演化学习求解框架.在训练阶段,首先通过种群演化的方法不断生成风格多样化的博弈对手作为训练数据,然后利用元策略更新方法来调整元模型的网络权重,使其获得快速适应的能力.在Leduc扑克、两人有限注德州扑克(Heads-up limit Texas Hold’em, LHE)和RoboSumo上的大量实验结果表明,该算法能够有效克服现有方法的弊端,实现针对未知风格对手的快速适应,从而为两人零和博弈收益最大化求解提供了一种新思路.
Recently, two-player zero-sum games have made impressive breakthroughs in the Go and Texas Hold’em. Most of the existing two-player zero-sum game solutions are based on the assumption of rational opponents to approximate the Nash equilibrium solutions, which is a conservative strategy of trying to be undefeated but does not guarantee maximum payoffs in practice due to the opponents’ irrationality. The opponent modeling provides a new way to maximize the payoff, but modeling has difficulties. This paper proposes a meta-evolutionary learning framework that can quickly adapt to the opponents. In the training phase, we first generate opponents with different styles as training data through the population evolution method, and then use the meta-strategy update method to adjust the network weights of the meta-model so that it can gain the ability to adapt quickly. Extensive experiments on Leduc poker, heads-up limit Texas Hold’em(LHE), and RoboSumo have shown that the algorithm can effectively overcome the drawbacks of existing methods and achieve fast adaptation to unknown style of opponents, thus providing a new way of solving two-player zero-sum games with maximum payoff.
作者
吴哲
李凯
徐航
兴军亮
WU Zhe;LI Kai;XU Hang;XING Jun-Liang(Center for Research on Intelligent System and Engineering,Institute of Automation,Chinese Academy of Sciences,Beijing 100190;School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049;Department of Computer Science and Technology,Tsinghua University,Beijing 100084)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2022年第10期2462-2473,共12页
Acta Automatica Sinica
基金
国家重点研发计划(2020AAA0103401)
国家自然科学基金(62076238,61902402)
中国科学院战略性先导研究项目(XDA27000000)
CCF-腾讯犀牛鸟基金(RAGR20200104)资助。
关键词
两人零和博弈
纳什均衡
对手建模
元学习
种群演化
Two-player zero-sum games
Nash equilibrium
opponent modeling
meta learning
population evolution