期刊文献+

深度Q学习的二次主动采样方法 被引量:16

Twice Sampling Method in Deep Q-network
下载PDF
导出
摘要 实现深度Q学习的一种方式是深度Q网络(Deep Q-networks,DQN).经验回放方法利用经验池中的样本训练深度Q网络,构造经验池需要智能体与环境进行大量交互,这样会增加成本和风险.一种减少智能体与环境交互次数的有效方式是高效利用样本.样本所在序列的累积回报对深度Q网络训练有影响.累积回报大的序列中的样本相对于累积回报小的序列中的样本更能加速深度Q网络的收敛速度,并提升策略的质量.本文提出深度Q学习的二次主动采样方法.首先,根据序列累积回报的分布构造优先级对经验池中的序列进行采样.然后,在已采样的序列中根据样本的TD-error(Temporal-difference error)分布构造优先级对样本采样.随后用两次采样得到的样本训练深度Q网络.该方法从序列累积回报和TD-error两个方面选择样本,以达到加速深度Q网络收敛,提升策略质量的目的.在Atari平台上进行了验证.实验结果表明,用经过二次主动采样得到的样本训练深度Q网络取得了良好的效果. One way of implementing the deep Q-learning is the deep Q-networks (DQN). Experience replay is known to train deep Q-networks by reusing transitions from a replay memory. However, an agent needs to interact with the environment lots of times to construct the replay memory, which will increase the cost and risk. To reduce the times of interaction, one way is to use the transitions more efficiently. The cumulative reward of an episode where one transition is obtained from has an impact on the training of DQN. If a transition is obtained from the episode which can get a big cumulative reward, it can accelerate the convergence of DQN and improve the best policy compared with the transition which is obtained from a small cumulative reward's episode. In this paper, we develop a framework for twice active sampling method in the deep Q-learning. First of all, we sample the episodes from the replay memory based on their cumulative reward. Then we sample the transitions from the selected episodes based on their temporal-difference error (TD-error). In the end, we train the DQN with these transitions. The method proposed in this paper not only accelerates the convergence of the deep Q-learning, but also leads to better policies because we replay transitions based on both TD-error and cumulative reward. By analyzing the results on Atari games, the experiments have shown that our method can achieve good results.
作者 赵英男 刘鹏 赵巍 唐降龙 ZHAO Ying-Nan;LIU Peng;ZHAO Wei(Pattern Recognition and Intelligent System Research Center,School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001)
出处 《自动化学报》 EI CSCD 北大核心 2019年第10期1870-1882,共13页 Acta Automatica Sinica
基金 国家自然科学基金(61671175,61672190)资助~~
关键词 优先经验回放 TD-error 深度Q网络 累积回报 Prioritized experience replay temporal-difference error (TD-error) deep Q-networks (DQN) cumulative reward
  • 相关文献

参考文献4

二级参考文献77

  • 1Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbren- ner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484-489.
  • 2Tian Y D, Zhu Y. Better computer go player with neural network and long-term prediction. In: International Confer- ence on Learning Representation (ICLR). San Juan, Puerto Rico, 2016.
  • 3Werbos P. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences [Ph.D. dissertation], Harvard University, USA, 1974.
  • 4Parker D B. Learning Logic, Technical Report TR-47, MIT Press, Cambridge, 1985.
  • 5LeCun Y. Une proc6dure d'apprentissage pour R6seau seuil assymatrique (a learning scheme for asymmetric threshold networks). In: Proceddings of the Cognitiva 85. Paris, France. 599-604 (in French).
  • 6Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors. Nature, 1986, 323(6088): 533-536.
  • 7Bengio Y. Learning Deep Architectures for AI. Hanover MA: Now Publishers Inc. 2009.
  • 8Hinton G E, Osindero S, Teh Y W. A fast learning algo- rithm for deep belief nets. Neural Computation, 2006, 18(7): 1527-1554.
  • 9Ranzato M, Poultney C, Chopra S, LeCun Y. Efficient learn- ing of sparse representations with an energy-based model. In: Proceedings of the 2007 Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2007.
  • 10Bengio Y, Lamblin P, Popovici D, Larochelle H. Greedy layer-wise training of deep networks. In: Proceedings of the 2007 Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2007.

共引文献340

同被引文献94

引证文献16

二级引证文献68

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部