In cooperative multiagent systems, to learn the optimal policies of multiagents is very difficult. As the numbers of states and actions increase exponentially with the number of agents, their action policies become mo...In cooperative multiagent systems, to learn the optimal policies of multiagents is very difficult. As the numbers of states and actions increase exponentially with the number of agents, their action policies become more intractable. By learning these value functions, an agent can learn its optimal action policies for a task. If a task can be decomposed into several subtasks and the agents have learned the optimal value functions for each subtask, this knowledge can be helpful for the agents in learning the optimal action policies for the whole task when they are acting simultaneously. When merging the agents’ independently learned optimal value functions, a novel multiagent online reinforcement learning algorithm LU-Q is proposed. By applying a transformation to the individually learned value functions, the constraints on the optimal value functions of each subtask are loosened. In each learning iteration process in algorithm LU-Q, the agents’ joint action set in a state is processed. Some actions of that state are pruned from the available action set according to the defined multiagent value function in LU-Q. As the items of the available action set of each state are reduced gradually in the iteration process of LU-Q, the convergence of the value functions is accelerated. LU-Q’s effectiveness, soundness and convergence are analyzed, and the experimental results show that the learning performance of LU-Q is better than the performance of standard Q learning.展开更多
文摘In cooperative multiagent systems, to learn the optimal policies of multiagents is very difficult. As the numbers of states and actions increase exponentially with the number of agents, their action policies become more intractable. By learning these value functions, an agent can learn its optimal action policies for a task. If a task can be decomposed into several subtasks and the agents have learned the optimal value functions for each subtask, this knowledge can be helpful for the agents in learning the optimal action policies for the whole task when they are acting simultaneously. When merging the agents’ independently learned optimal value functions, a novel multiagent online reinforcement learning algorithm LU-Q is proposed. By applying a transformation to the individually learned value functions, the constraints on the optimal value functions of each subtask are loosened. In each learning iteration process in algorithm LU-Q, the agents’ joint action set in a state is processed. Some actions of that state are pruned from the available action set according to the defined multiagent value function in LU-Q. As the items of the available action set of each state are reduced gradually in the iteration process of LU-Q, the convergence of the value functions is accelerated. LU-Q’s effectiveness, soundness and convergence are analyzed, and the experimental results show that the learning performance of LU-Q is better than the performance of standard Q learning.