摘要
真实动态博弈场景下对抗双方存在信息不对等、工作机理和规则不相同等特征,但现有的强化学习算法通过假设状态可观测或部分可观测来采用近似模型拟合。因此,在难以准确获取或者无法获取对方状态信息时,假设条件难以成立,导致现有强化学习模型无法直接适用。针对这个问题,提出一种基于非对称不可观测强化学习新框架,在该框架下,智能体仅根据价值反馈即可实现在线学习。为验证可行性和通用性,将3种典型强化学习算法移植到该算法框架,搭建了博弈对抗模型,进行对比验证。结果表明,3种算法都可成功应用于不可观测状态的动态博弈环境,且收敛速度大幅提高,证明了该框架的可行性和通用性。
In real dynamic game scenarios,there are characteristics such as unequal information,various working mechanisms,and different rules between adversaries.However,the existing reinforcement learning algorithms use approximate model fitting by assuming that the state is fully observable or partially observable.Therefore,it is hard to establish assumptions when it is hard to accurately obtain or unable to obtain the status information of the other party,which result in existing reinforcement learning models that cannot be directly applied.To solve this problem,a new framework based on asymmetric unobservable reinforcement learning is proposed.Under this framework,agents can achieve online learning only based on value feedback.In order to verify the feasibility and versatility of the proposed framework,three typical reinforcement learning algorithms are transplanted into the proposed algorithm framework,and a game confrontation model is built for comparative verification.The results show that the three algorithms can be successfully applied to dynamic game environments with unobservable states,and the convergence speed is greatly improved,which proves the feasibility and versatility of the proposed framework.
作者
李欣致
董胜波
崔向阳
LI Xinzhi;DONG Shengbo;CUI Xiangyang(Beijing Institute of Remote Sensing Equipment,Beijing 100854,China;State Key Laboratory of Communication Content Cognition,Beijing 100733,China)
出处
《系统工程与电子技术》
EI
CSCD
北大核心
2023年第6期1755-1761,共7页
Systems Engineering and Electronics
关键词
强化学习
动态博弈
非对称不可观测状态
reinforcement learning
dynamic game
asymmetric unobservable state