In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinfo...In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision(HADS)which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism.HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning.Due to the fact that not all sub-goal points are reachable,a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm.HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals,while those that are not will be removed based on prior knowledge.Our experiments verified the effect of the algorithm.展开更多
Policy iteration,which evaluates and improves the control policy iteratively,is a reinforcement learning method.Policy evaluation with the least-squares method can draw more useful information from the empirical data ...Policy iteration,which evaluates and improves the control policy iteratively,is a reinforcement learning method.Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity.However,most existing online least-squares policy iteration methods only use each sample just once,resulting in the low utilization rate.With the goal of improving the utilization efficiency,we propose an experience replay for least-squares policy iteration(ERLSPI)and prove its convergence.ERLSPI method combines online least-squares policy iteration method with experience replay,stores the samples which are generated online,and reuses these samples with least-squares method to update the control policy.We apply the ERLSPI method for the inverted pendulum system,a typical benchmark testing.The experimental results show that the method can effectively take advantage of the previous experience and knowledge,improve the empirical utilization efficiency,and accelerate the convergence speed.展开更多
基金supported by the National Natural Science Foundation of China(61303108)Suzhou Key Industries Technological Innovation-Prospective Applied Research Project(SYG201804)+2 种基金A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)the Fundamental Research Funds for the Gentral UniversitiesJLU(93K172020K25)。
文摘In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision(HADS)which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism.HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning.Due to the fact that not all sub-goal points are reachable,a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm.HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals,while those that are not will be removed based on prior knowledge.Our experiments verified the effect of the algorithm.
文摘Policy iteration,which evaluates and improves the control policy iteratively,is a reinforcement learning method.Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity.However,most existing online least-squares policy iteration methods only use each sample just once,resulting in the low utilization rate.With the goal of improving the utilization efficiency,we propose an experience replay for least-squares policy iteration(ERLSPI)and prove its convergence.ERLSPI method combines online least-squares policy iteration method with experience replay,stores the samples which are generated online,and reuses these samples with least-squares method to update the control policy.We apply the ERLSPI method for the inverted pendulum system,a typical benchmark testing.The experimental results show that the method can effectively take advantage of the previous experience and knowledge,improve the empirical utilization efficiency,and accelerate the convergence speed.