摘要
The size of state-space is the limiting factor in applying reinforcement learning algorithms to practical cases. A reinforcement learning system with partitioning function (RLWPF) is established, in which state-space is partitioned into several regions. Inside the performance principle of RLWPF is based on a Semi-Markov decision process and has general significance. It can be applied to any reinforcement learning with a large state-space. In RLWPF, the partitioning module dispatches agents into different regions in order to decrease the state-space of each agent. This article proves the convergence of the SARSA algorithm for a Semi-Markov decision process, ensuring the convergence of RLWPF by analyzing the equivalence of two value functions in two Semi-Markov decision processes before and after partitioning. This article can show that the optimal policy learned by RLWPF is consistent with prior domain knowledge. An elevator group system is devised to decrease the average waiting time of passengers. Four agents control four elevator cars respectively. Based on RLWPF, a partitioning module is developed through defining a uniform round trip time as the partitioning criteria, making the wait time of most passengers more or less identical then elevator cars should only answer hall calls in their own region. Compared with ordinary elevator systems and reinforcement learning systems without partitioning module, the performance results show the advantage of RLWPF.
The size of state-space is the limiting factor in applying reinforcement learning algorithms to practical cases. A reinforcement learning system with partitioning function (RLWPF) is established, in which state-space is partitioned into several regions. Inside the performance principle of RLWPF is based on a Semi-Markov decision process and has general significance. It can be applied to any reinforcement learning with a large state-space. In RLWPF, the partitioning module dispatches agents into different regions in order to decrease the state-space of each agent. This article proves the convergence of the SARSA algorithm for a Semi-Markov decision process, ensuring the convergence of RLWPF by analyzing the equivalence of two value functions in two Semi-Markov decision processes before and after partitioning. This article can show that the optimal policy learned by RLWPF is consistent with prior domain knowledge. An elevator group system is devised to decrease the average waiting time of passengers. Four agents control four elevator cars respectively. Based on RLWPF, a partitioning module is developed through defining a uniform round trip time as the partitioning criteria, making the wait time of most passengers more or less identical then elevator cars should only answer hall calls in their own region. Compared with ordinary elevator systems and reinforcement learning systems without partitioning module, the performance results show the advantage of RLWPF.
基金
SponsoredbytheNationalNaturalScienceFoundationofChina(GrantNo .6 9975 0 1 3) .