摘要
在多智能体强化学习算法的研究中,由于训练与测试环境具有差异,如何让智能体有效地应对环境中其他智能体策略变化的情况受到研究人员的广泛关注。针对这一泛化性问题,提出基于人类偏好的多智能体角色策略集成算法,该算法同时考虑了长期回报和即时回报。这一改进使得智能体从一些具有良好长期累积回报的候选行动中选择具有最大即时回报的行动,从而让算法确定了策略更新的方向,避免过度探索和无效训练,能快速找到最优策略。此外,智能体被动态地划分为不同的角色,同角色智能体共享参数,不仅提高了效率,而且实现了多智能体算法的可扩展性。在多智能体粒子环境中与现有算法的比较表明,该算法的智能体能更好地泛化到未知环境,且收敛速度更快,能够更高效地训练出最优策略。
In the research of multi-agent reinforcement learning algorithm,due to the difference between training and testing environment,how to make agents intelligently learn to cope with the performance degradation caused by the change of other agents’policy in the environment has been widely concerned by researchers.To solve this generalization problem,human-preference based multi-agent role policy ensemble is proposed,which considers the effects of long-term reward and immediate reward.This improvement enables the algorithm to determine the direction of policy updating to avoid excessive exploration and ineffective training.In addition,agents are classified into different roles according to their immediate rewards of historical actions.Thus the parameters are shared with the same-role agent,which improves efficiency and achieves the scalability of the multi-agent algorithm.The comparison with the existing algorithm in the multi-agent particle environment shows that the proposed algorithm has a faster convergence speed which can effectively train the optimal strategy,and its intelligence can better generalize to the unknown environment.
作者
郭鑫
王微
青伟
李剑
何召锋
GUO Xin;WANG Wei;QING Wei;LI Jian;HE Zhao-feng(Beijing University of Posts and Telecommunications,Beijing 100088,China)
出处
《计算机技术与发展》
2023年第4期114-119,共6页
Computer Technology and Development
基金
国家自然科学基金(62176025,62076232)
中央高校基本科研业务费专项资金资助(2021RC38,2021RC39)。
关键词
深度强化学习方法
多智能体
未知环境
策略集成
泛化性
可扩展性
deep reinforcement learning
multi-agent
unknown environment
policy ensemble
generalization
scalability