摘要
多智能体高效协作是多智能体深度强化学习的重要目标,然而多智能体决策系统中存在的环境非平稳、维数灾难等问题使得这一目标难以实现。现有值分解方法可在环境平稳性和智能体拓展性之间取得较好平衡,但忽视了智能体策略网络的重要性,并且在学习联合动作值函数时未充分利用经验池中保存的完整历史轨迹。提出一种基于多智能体多步竞争网络的多智能体协作方法,在训练过程中使用智能体网络和价值网络对智能体动作评估和环境状态评估进行解耦,同时针对整条历史轨迹完成多步学习以估计时间差分目标,通过优化近似联合动作值函数的混合网络集中且端到端地训练分散的多智能体协作策略。实验结果表明,该方法在6种场景中的平均胜率均优于基于值分解网络、单调值函数分解、值函数变换分解、反事实多智能体策略梯度的多智能体协作方法,并且具有较快的收敛速度和较好的稳定性。
Multi-agent efficient cooperation is an important goal in Multi-Agent Deep Reinforcement Learning(MADRL);however,environmental non-stationarity and dimensionality disasters in multi-agent decision-making systems render it difficult to achieve this goal.Existing value-decomposition methods can achieve a good balance between environment stationarity and agent scalability.Nevertheless,some value-decomposition methods disregard the importance of the agent-policy network and do not fully utilize the full historical trajectories saved in the experience pool when learning joint action-value functions.Hence,a method for multi-agent cooperation based on Multi-agent Multi-step Dueling Network(MMDN)is proposed herein.First,action estimation and state estimation are decoupled through an independent agent network and a value network during training;additionally,the temporal-difference target is estimated via multistep learning for the entire history trajectory.Second,decentralized multi-agent cooperation policies are trained via a centralized end-to-end mode by optimizing a mixing network that approximates the joint action-value function.Experimental results show that the average winning rate of this method in six scenarios is better than those of multi-agent cooperative methods based on the Value-Decomposition Network(VDN),QMIX,QTRAN,and Counterfactual Multi-Agent(COMA)policy gradient.Additionally,it offers a higher convergence speed and better stability.
作者
厉子凡
王浩
方宝富
LI Zifan;WANG Hao;FANG Baofu(School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230601,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2022年第5期74-81,共8页
Computer Engineering
基金
国家自然科学基金(61876206)
中央高校基本科研业务费专项资金(ACAIM190102)
安徽省自然科学基金(1708085MF146)
民航飞行技术与飞行安全重点实验室开放基金(FZ2020KF15)。
关键词
多智能体协作
深度强化学习
值分解
多步竞争网络
动作值函数
multi-agent cooperation
Deep Reinforcement Learning(DRL)
value-decomposition
multi-step dueling network
action value function