期刊文献+

一种带探索噪音的深度循环Q网络 被引量:11

A Deep Recurrent Q Network with Exploratory Noise
下载PDF
导出
摘要 结合深度神经网络和强化学习方法的深度Q网络在Atari 2600游戏平台上取得了巨大成功.相较于深度Q网络,深度循环Q网络具有记忆历史信息的能力,在部分游戏上显示出了更好的性能.然而在某些复杂的游戏环境中,一方面深度循环Q网络需要大量的训练时间,另一方面其在动作空间抖动的策略中不能做出合理决策.针对这些问题,本文提出一种带探索噪音的深度循环Q网络(Deep Recurrent Q-Networks with Exploratory Noise, EN-DRQN)模型.与在动作空间的探索方式不同,EN-DRQN在网络空间注入噪音,引起网络输出变化,然后根据该变化选择动作.这种在网络空间的探索可以在未来多个时间步内造成复杂的改变,并通过循环神经网络记忆多步变化,使智能体(Agent)做出的决策更具有战略性.EN-DRQN具有以下特点:一是利用带探索性的噪音进行深度探索以弥补传统策略探索的低效性.噪音来自于噪音分布,通过方差驱动探索,这使得Agent可以发现大量新状态,提供更加丰富的样本,为决策提供有效信息;二是使用改进的双层门限循环单元来记忆较长时间步的历史信息,使Agent能够在延迟奖赏的情况下做出合理的决策.实验结果表明,EN-DRQN模型在Atari 2600游戏平台上的部分战略性游戏以及具有延迟奖赏的游戏上,与动作空间的抖动策略相比,取得了更优的表现。 The deep Q network, which combines the deep neural network and the reinforcement learning method, has achieved great success in the Atari 2600 game platform. Compared with the deep Q network, the deep recurrent Q network has the ability of memorizing historical information and shows better performance in some strategic tasks and games with delayed rewards. However, on the one hand, traditional dithering policy cannot make a reasonable decision;on the other hand, in some complex game environments, because of the complex structure and activation function, the deep recurrent Q network requires a lot of training time, which increases the difficulty of the task. To solve these problems, the model of the deep recurrent Q network with exploratory noise (EN-DRQN) is proposed. Different from the way of exploration in the action space, EN-DRQN injects noise directly into the network space, which changes the output of the network, and the agent adjusts its policy through this change. The exploration in the network space can cause complex changes in multiple time steps, and the agent can memory multistep changes through the recurrent neural network memory, making the decision more strategic. Taking the above measures, the decision made by the agent is more strategic. EN-DRQN has the following characteristics: first of all, exploratory noise is used to make up for the inefficiency of traditional policy exploration. The noise comes from the noise distribution and the agent has ability to explore through the variance. If the selected action is the optimal action, the agent will reduce the scale of the noise to reduce the variance of the network. If the selected action is not the optimal action, increasing the variance of the network to improve the possibility of selecting the optimal action. The variables that control the noise scale are trained by the gradient descent method along with other weight parameters in the network. The noise policy in the network space improves the possibility of finding new states. Mining new state can provide more abundant samples for the agent at the learning stage and provide effective information for decision-making. Second of all, we use the improved double gated recurrent units to memorize the historical information of a long time step. This operation enables the agent to make reasonable decisions in the case of delayed rewards. In the deep recurrent Q network, the ability of a single layer LSTM network to memory history information is relatively limited, and it cannot achieve satisfactory performance in a few strategic environments. At the same time, the LSTM unit increases the demand for computing resources. Last but not least, we tested the effectiveness of EN-DRQN in eight games such as AirRaid, BeamRider, Centipede, Freeway and so on. It is worth mentioning that the eight games are all strategic and the rewards given by the environment are delayed. We compared the model by evaluating the average score of each game. Compared with dithering policy in the action space, the experimental results show that EN-DRQN has achieved better performance in some strategy games and games with delayed rewards.
作者 刘全 闫岩 朱斐 吴文 张琳琳 LIU Quan;YAN Yan;ZHU Fei;WU Wen;ZHANG Lin-Lin(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006;Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012;Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, Jiangsu 215006)
出处 《计算机学报》 EI CSCD 北大核心 2019年第7期1588-1604,共17页 Chinese Journal of Computers
基金 国家自然科学基金(61772355,61472262) 江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004) 吉林大学符号计算与知识工程教育部重点实验室资助项目(93K172014K04,93K172017K18) 苏州市应用基础研究计划工业部分(SYG201422,SYG201804) 江苏省级重点实验室(苏州大学)项目(KJS1524)资助~~
关键词 深度学习 强化学习 循环神经网络 卷积神经网络 噪音探索 deep learning reinforcement learning recurrent neural network convolution neural network exploratory noise
  • 相关文献

参考文献4

二级参考文献14

共引文献703

同被引文献78

引证文献11

二级引证文献45

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部