摘要
针对深度强化学习方法在机械臂的接近技能学习中普遍存在的样本效率低、泛化性差的问题,提出一种基于元Q学习的技能学习方法。首先利用结合后视经验回放(Hindsight Experience Replay, HER)的DDPG训练机械臂以指定姿态到达目标点,验证了算法在接近任务中的有效性;其次,在相关任务集上构造多任务目标作为优化对象,利用结合HER的DDPG训练模型,得到泛化性强的元训练模型和元训练数据,此外利用GRU获取轨迹上下文变量;最后,先在新任务上进行少量训练,再利用元训练数据训练模型进一步提升性能。仿真实验表明,在初始性能、学习速率和收敛性能三方面元Q学习均带来明显提升,其中达到期望性能所需样本量降低77%,平均成功率提高15%。
Since the deep reinforcement learning methods that manipulators employ to learning reaching skills perform at low sample efficiency and poor generalization, a skill learning method based on the meta-Q learning is proposed. First, the deep deterministic policy gradient(DDPG) combined with the hindsight experience replay(HER) is used to train a manipulator to reach the target point with a specified attitude. It verifies the effectiveness of the algorithm in reaching tasks. Second, a multi-task objective is constructed on the relevant task set and designated as the optimization object. DDPG combined with HER is used to train the model and obtain meta-training data and a meta-training model with strong generalization. GRU is also used to obtain trajectory context variables. Finally, a small amount of training is performed on the new task, and then the meta-training data are used to train the model to further improve the performance. Simulation experiments show that the meta-Q-learning brings significant improvements in the initial performance, learning rate and convergence performance. The sample size required to achieve the desired performance is reduced by 77%, and the average success rate is increased by 15%.
作者
李茂捷
徐国政
高翔
谭彩铭
LI Maojie;XU Guozheng;GAO Xiang;TAN Caiming(College of Automation&College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;Robotics Information Sensing and Control Institute,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处
《南京邮电大学学报(自然科学版)》
北大核心
2023年第1期96-103,共8页
Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金
江苏省自然科学基金(BK20210599)
江苏省高等学校自然科学研究项目(20KJB510023)资助项目。
关键词
机器人学习
元强化学习
深度确定性策略梯度
元Q学习
样本效率
robot learning
meta reinforcement learning
deep deterministic policy gradient(DDPG)
meta-Q-learning
sample efficiency