A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for genera...A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.展开更多
The autonomous mobile robots must be flexible to learn the new complex control behaviours in order to adapt effectively to a dynamic and varying environment. The proposed approach of this paper is to create a controll...The autonomous mobile robots must be flexible to learn the new complex control behaviours in order to adapt effectively to a dynamic and varying environment. The proposed approach of this paper is to create a controller that learns the complex behaviours incorporating the learning from demonstration to reduce the search space and to improve the demonstrated task geometry by trial and corrections. The task faced by the robot has uncertainty that must be learned. Simulation results indicate that after the handful of trials, robot has learned the right policies and avoided the obstacles and reached the goal.展开更多
In this paper,an efficient skill learning framework is proposed for robotic insertion,based on one-shot demonstration and reinforcement learning.First,the robot action is composed of two parts:expert action and refine...In this paper,an efficient skill learning framework is proposed for robotic insertion,based on one-shot demonstration and reinforcement learning.First,the robot action is composed of two parts:expert action and refinement action.A force Jacobian matrix is calibrated with only one demonstration,based on which stable and safe expert action can be generated.The deep deterministic policy gradients(DDPG)method is employed to learn the refinement action,which aims to improve the assembly efficiency.Second,an episode-step exploration strategy is developed,which uses the expert action as a benchmark and adjusts the exploration intensity dynamically.A safety-efficiency reward function is designed for the compliant insertion.Third,to improve the adaptability with different components,a skill saving and selection mechanism is proposed.Several typical components are used to train the skill models.And the trained models and force Jacobian matrices are saved in a skill pool.Given a new component,the most appropriate model is selected from the skill pool according to the force Jacobian matrix and directly used to accomplish insertion tasks.Fourth,a simulation environment is established under the guidance of the force Jacobian matrix,which avoids tedious training process on real robotic systems.Simulation and experiments are conducted to validate the effectiveness of the proposed methods.展开更多
Learning from demonstration(LfD)is an appealing method of helping robots learn new skills.Numerous papers have presented methods of LfD with good performance in robotics.However,complicated robot tasks that need to ca...Learning from demonstration(LfD)is an appealing method of helping robots learn new skills.Numerous papers have presented methods of LfD with good performance in robotics.However,complicated robot tasks that need to carefully regulate path planning strategies remain unanswered.Contact or non-contact constraints in specific robot tasks make the path planning problem more difficult,as the interaction between the robot and the environment is time-varying.In this paper,we focus on the path planning of complex robot tasks in the domain of LfD and give a novel perspective for classifying imitation learning and inverse reinforcement learning.This classification is based on constraints and obstacle avoidance.Finally,we summarize these methods and present promising directions for robot application and LfD theory.展开更多
In actor-critic reinforcement learning(RL)algorithms,function estimation errors are known to cause ineffective random exploration at the beginning of training,and lead to overestimated value estimates and suboptimal p...In actor-critic reinforcement learning(RL)algorithms,function estimation errors are known to cause ineffective random exploration at the beginning of training,and lead to overestimated value estimates and suboptimal policies.In this paper,we address the problem by executing advantage rectification with imperfect demonstrations,thus reducing the function estimation errors.Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain.However,existing methods,such as behavior cloning,often assume the demonstrations contain other information or labels with regard to performances,such as optimal assumption,which is usually incorrect and useless in the real world.In this paper,we explicitly handle imperfect demonstrations within the actor-critic RL frameworks,and propose a new method called learning from imperfect demonstrations with advantage rectification(LIDAR).LIDAR utilizes a rectified loss function to merely learn from selective demonstrations,which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy.LIDAR learns from contradictions caused by estimation errors,and in turn reduces estimation errors.We apply LIDAR to three popular actor-critic algorithms,DDPG,TD3 and SAC,and experiments show that our method can observably reduce the function estimation errors,effectively leverage demonstrations far from the optimal,and outperform state-of-the-art baselines consistently in all the scenarios.展开更多
The main idea of reinforcement learning is evaluating the chosen action depending on the current reward.According to this concept,many algorithms achieved proper performance on classic Atari 2600 games.The main challe...The main idea of reinforcement learning is evaluating the chosen action depending on the current reward.According to this concept,many algorithms achieved proper performance on classic Atari 2600 games.The main challenge is when the reward is sparse or missing.Such environments are complex exploration environments likeMontezuma’s Revenge,Pitfall,and Private Eye games.Approaches built to deal with such challenges were very demanding.This work introduced a different reward system that enables the simple classical algorithm to learn fast and achieve high performance in hard exploration environments.Moreover,we added some simple enhancements to several hyperparameters,such as the number of actions and the sampling ratio that helped improve performance.We include the extra reward within the human demonstrations.After that,we used Prioritized Double Deep Q-Networks(Prioritized DDQN)to learning from these demonstrations.Our approach enabled the Prioritized DDQNwith a short learning time to finish the first level of Montezuma’s Revenge game and to perform well in both Pitfall and Private Eye.We used the same games to compare our results with several baselines,such as the Rainbow and Deep Q-learning from demonstrations(DQfD)algorithm.The results showed that the new rewards system enabled Prioritized DDQN to out-perform the baselines in the hard exploration games with short learning time.展开更多
文摘A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.
文摘The autonomous mobile robots must be flexible to learn the new complex control behaviours in order to adapt effectively to a dynamic and varying environment. The proposed approach of this paper is to create a controller that learns the complex behaviours incorporating the learning from demonstration to reduce the search space and to improve the demonstrated task geometry by trial and corrections. The task faced by the robot has uncertainty that must be learned. Simulation results indicate that after the handful of trials, robot has learned the right policies and avoided the obstacles and reached the goal.
基金supported by National Key Research and Development Program of China(No.2018AAA0103005)National Natural Science Foundation of China(No.61873266)。
文摘In this paper,an efficient skill learning framework is proposed for robotic insertion,based on one-shot demonstration and reinforcement learning.First,the robot action is composed of two parts:expert action and refinement action.A force Jacobian matrix is calibrated with only one demonstration,based on which stable and safe expert action can be generated.The deep deterministic policy gradients(DDPG)method is employed to learn the refinement action,which aims to improve the assembly efficiency.Second,an episode-step exploration strategy is developed,which uses the expert action as a benchmark and adjusts the exploration intensity dynamically.A safety-efficiency reward function is designed for the compliant insertion.Third,to improve the adaptability with different components,a skill saving and selection mechanism is proposed.Several typical components are used to train the skill models.And the trained models and force Jacobian matrices are saved in a skill pool.Given a new component,the most appropriate model is selected from the skill pool according to the force Jacobian matrix and directly used to accomplish insertion tasks.Fourth,a simulation environment is established under the guidance of the force Jacobian matrix,which avoids tedious training process on real robotic systems.Simulation and experiments are conducted to validate the effectiveness of the proposed methods.
基金supported by the National Natural Science Foundation of China(Grant No.91848202)the Foundation for Innovative Research Groups of the National Natural Science Foundation of China(Grant No.51521003)。
文摘Learning from demonstration(LfD)is an appealing method of helping robots learn new skills.Numerous papers have presented methods of LfD with good performance in robotics.However,complicated robot tasks that need to carefully regulate path planning strategies remain unanswered.Contact or non-contact constraints in specific robot tasks make the path planning problem more difficult,as the interaction between the robot and the environment is time-varying.In this paper,we focus on the path planning of complex robot tasks in the domain of LfD and give a novel perspective for classifying imitation learning and inverse reinforcement learning.This classification is based on constraints and obstacle avoidance.Finally,we summarize these methods and present promising directions for robot application and LfD theory.
基金This work was supported by the National Key R&D Plan(2016YFB0100901)the National Natural Science Foundation of China(Grant Nos.U20B2062&61673237)the Beijing Municipal Science&Technology Project(Z191100007419001).
文摘In actor-critic reinforcement learning(RL)algorithms,function estimation errors are known to cause ineffective random exploration at the beginning of training,and lead to overestimated value estimates and suboptimal policies.In this paper,we address the problem by executing advantage rectification with imperfect demonstrations,thus reducing the function estimation errors.Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain.However,existing methods,such as behavior cloning,often assume the demonstrations contain other information or labels with regard to performances,such as optimal assumption,which is usually incorrect and useless in the real world.In this paper,we explicitly handle imperfect demonstrations within the actor-critic RL frameworks,and propose a new method called learning from imperfect demonstrations with advantage rectification(LIDAR).LIDAR utilizes a rectified loss function to merely learn from selective demonstrations,which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy.LIDAR learns from contradictions caused by estimation errors,and in turn reduces estimation errors.We apply LIDAR to three popular actor-critic algorithms,DDPG,TD3 and SAC,and experiments show that our method can observably reduce the function estimation errors,effectively leverage demonstrations far from the optimal,and outperform state-of-the-art baselines consistently in all the scenarios.
文摘The main idea of reinforcement learning is evaluating the chosen action depending on the current reward.According to this concept,many algorithms achieved proper performance on classic Atari 2600 games.The main challenge is when the reward is sparse or missing.Such environments are complex exploration environments likeMontezuma’s Revenge,Pitfall,and Private Eye games.Approaches built to deal with such challenges were very demanding.This work introduced a different reward system that enables the simple classical algorithm to learn fast and achieve high performance in hard exploration environments.Moreover,we added some simple enhancements to several hyperparameters,such as the number of actions and the sampling ratio that helped improve performance.We include the extra reward within the human demonstrations.After that,we used Prioritized Double Deep Q-Networks(Prioritized DDQN)to learning from these demonstrations.Our approach enabled the Prioritized DDQNwith a short learning time to finish the first level of Montezuma’s Revenge game and to perform well in both Pitfall and Private Eye.We used the same games to compare our results with several baselines,such as the Rainbow and Deep Q-learning from demonstrations(DQfD)algorithm.The results showed that the new rewards system enabled Prioritized DDQN to out-perform the baselines in the hard exploration games with short learning time.