为了解决基于平均场的多智能体强化学习(M3-UCRL)算法中的环境动力学模型对下一时刻状态预测不精确和策略学习样本过少的问题。本文利用了去噪概率扩散模型(Denoising Diffusion Probabilistic Models, DDPM)的数据生成能力,提出了一种...为了解决基于平均场的多智能体强化学习(M3-UCRL)算法中的环境动力学模型对下一时刻状态预测不精确和策略学习样本过少的问题。本文利用了去噪概率扩散模型(Denoising Diffusion Probabilistic Models, DDPM)的数据生成能力,提出了一种基于DDPM的平均场多智能体强化学习(DDPM-M3RL)算法。该算法将环境模型的生成表述为去噪问题,利用DDPM算法,提高了环境模型对下一时刻状态预测的精确度,也为后续的策略学习提供了充足的样本数据,提高了策略模型的收敛速度。实验结果表明,该算法可以有效提高环境动力学模型对下一时刻状态预测的精确度,根据环境动力学模型生成的状态转移数据可以为策略学习提供充足的学习样本,有效提高了导航策略的性能和稳定性。To solve the problems of inaccurate prediction of the next state by the environment dynamics model and too few samples for policy learning in the mean field based multi-agent reinforcement learning (M3-UCRL) algorithm, this paper takes advantage of the data generation capability of denoising diffusion probability models (DDPM) and proposes a mean field multi-agent reinforcement learning (DDPM-M3RL) algorithm based on DDPM. The algorithm formulates the generation of the environment model as a denoising problem. By using the DDPM algorithm, the accuracy of the environment model’s prediction of the next state is improved, and sufficient sample data is provided for subsequent policy learning, which improves the convergence speed of the policy model. Experimental results show that the algorithm can effectively improve the accuracy of the environment dynamics model’s prediction of the next state, and the state transition data generated by the environment dynamics model can provide sufficient learning samples for policy learning, which effectively improves the performance and stability of the navigation strategy.展开更多
为了简化使用完美马尔科夫均衡方法可能引起的复杂计算过程,本文依据博弈论方法,提出一种平均场均衡的无线自组织网络路由协议(mean field equilibrium AODV,MFEA)。该方法要求每个节点利用所有其他节点的信息来分析自己的最优策略,而...为了简化使用完美马尔科夫均衡方法可能引起的复杂计算过程,本文依据博弈论方法,提出一种平均场均衡的无线自组织网络路由协议(mean field equilibrium AODV,MFEA)。该方法要求每个节点利用所有其他节点的信息来分析自己的最优策略,而不需要知道每一个局中人的信息,并且在足够大的局中人数目情况下性能更加近似马尔科夫均衡。仿真实验显示:提出的MFEA路由协议在包投递率、时延和归一化开销方面均优于AODV(Ad hoc on-demand distance vector routing)协议,在节点密集的无线自组织网络中仍可获得比较好效果。展开更多
文摘为了解决基于平均场的多智能体强化学习(M3-UCRL)算法中的环境动力学模型对下一时刻状态预测不精确和策略学习样本过少的问题。本文利用了去噪概率扩散模型(Denoising Diffusion Probabilistic Models, DDPM)的数据生成能力,提出了一种基于DDPM的平均场多智能体强化学习(DDPM-M3RL)算法。该算法将环境模型的生成表述为去噪问题,利用DDPM算法,提高了环境模型对下一时刻状态预测的精确度,也为后续的策略学习提供了充足的样本数据,提高了策略模型的收敛速度。实验结果表明,该算法可以有效提高环境动力学模型对下一时刻状态预测的精确度,根据环境动力学模型生成的状态转移数据可以为策略学习提供充足的学习样本,有效提高了导航策略的性能和稳定性。To solve the problems of inaccurate prediction of the next state by the environment dynamics model and too few samples for policy learning in the mean field based multi-agent reinforcement learning (M3-UCRL) algorithm, this paper takes advantage of the data generation capability of denoising diffusion probability models (DDPM) and proposes a mean field multi-agent reinforcement learning (DDPM-M3RL) algorithm based on DDPM. The algorithm formulates the generation of the environment model as a denoising problem. By using the DDPM algorithm, the accuracy of the environment model’s prediction of the next state is improved, and sufficient sample data is provided for subsequent policy learning, which improves the convergence speed of the policy model. Experimental results show that the algorithm can effectively improve the accuracy of the environment dynamics model’s prediction of the next state, and the state transition data generated by the environment dynamics model can provide sufficient learning samples for policy learning, which effectively improves the performance and stability of the navigation strategy.
文摘为了简化使用完美马尔科夫均衡方法可能引起的复杂计算过程,本文依据博弈论方法,提出一种平均场均衡的无线自组织网络路由协议(mean field equilibrium AODV,MFEA)。该方法要求每个节点利用所有其他节点的信息来分析自己的最优策略,而不需要知道每一个局中人的信息,并且在足够大的局中人数目情况下性能更加近似马尔科夫均衡。仿真实验显示:提出的MFEA路由协议在包投递率、时延和归一化开销方面均优于AODV(Ad hoc on-demand distance vector routing)协议,在节点密集的无线自组织网络中仍可获得比较好效果。