基于线性动态跳帧的深度双Q网络被引量：2

Deep Double Q-Network Based on Linear Dynamic Frame Skip

下载PDF

导出

摘要深度Q网络模型在处理需要感知高维输入数据的决策控制任务中性能良好.然而,在深度Q网络及其改进算法中基本使用静态的跳帧方法,即动作被重复执行固定的次数.另外,优先级经验重放是对均匀采样的一种改进,然而目前各个研究仅将样本的时间差分误差作为评价优先级的标准.针对这两个问题,该文提出一种基于线性动态跳帧和改进的优先级经验重放的深度双Q网络.该算法使得跳帧率成为一个可动态学习的参数,跳帧率随网络输出Q值的大小线性增长,Agent将根据当前状态和动作来动态地确定一个动作被重复执行的次数,并利用经验池中样本的每个动作的跳帧率和样本的时间差分误差共同决定样本的优先级.最后在Atari 2600游戏中进行实验,结果表明该算法相比于传统动态跳帧和优先级经验重放算法具有更优的效果. Deep Q-Network is able to perform human-level control for handling problems requiring both rich perception of high-dimensional raw inputs and policy control.However,the current state of the art architectures like Deep Q-Network and it improved algorithms adopt a traditional framework with a static frame skip rate,where the action output from the network is repeated for a fixed number of frames regardless of the current state.Although Dynamic Frame skip Deep Q-Network uses a dynamic frame skip rate,it doubles the number of nodes in the network output layer with a frame skip rate of 4 or 20.Such settings may cause an increase in the amount of computation of the network,and cause bad actions to be performed multiple times,thereby affecting the efficiency of learning.In addition,an important technique in Deep Q-Network is the use of an experience replay mechanism.The traditional method of uniformly sampling samples ignores the importance of samples.In order to increase the sampling rate of important samples,the prioritized experience replay is an improvement on uniform sampling,using only the temporal difference error of a sample as the criterion of evaluation priority.However,such priority evaluation criterion only considers the temporal difference error of the sample,and there may be other factors that affect the priority of the sample.In this paper,we propose a new algorithm:Deep Double Q-Network based on Linear Dynamic Frame Skip and Improved Prioritized Experiential Replay(LDF-IPER-DDQN in short).The value of the frame-skip rate increases linearly with the magnitude of the network output Q value,which allows Agent to dynamically select the number of times an action is repeated based on the current state and action.For the action with the largest Q value,the maximum frame skip rate is given to this action.In contrast,the action with the smallest Q value is given the minimum frame skip rate.In this way,the frame skip rate becomes a dynamic learnable parameter.Furthermore,the value of the frame skip rate for each action in the experience pool is considered as another factor to evaluate the sampling priority.The priority of the sample is determined by the frame skip rate of each action of the sample in the experience pool and the temporal difference error of the sample.As a sequence,if two transition samples have similar temporal difference errors,the transition samples with larger frame skip rates are replayed more frequently.In the experiment of this paper,we evaluate the performance of the new algorithm through eight challenging strategic games with sparse reward from the set of classic Atari 2600 games.They are Seaquest,Assault,Asterix,Q*bert,SpaceInvaders,Berzerk,BeamRider and Gopher.We evaluate the performance of the new algorithm during the training and testing phases.The experimental results show that LDF-IPER-DDQN performs better than some traditional deep reinforcement learning algorithms on these Atari 2600 games.It achieves better performance in terms of the average reward per episode and has better stability on these games.

作者陈松章晓芳章宗长刘全吴金金闫岩 CHEN Song;ZHANG Xiao-Fang;ZHANG Zong-Zhang;LIU Quan;WU Jin-Jin;YAN Yan(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012)

机构地区苏州大学计算机科学与技术学院南京大学计算机软件新技术国家重点实验室吉林大学符号计算与知识工程教育部重点实验室

出处《计算机学报》 EI CSCD 北大核心 2019年第11期2561-2573,共13页 Chinese Journal of Computers

基金国家自然科学基金项目(61472262,61502329,61772355,61876119) 江苏省自然科学基金面上项目(BK20181432) 吉林大学符号计算与知识工程教育部重点实验室基金项目(93K172014K04,93K172017K18) 苏州市重点产业技术创新-前瞻性应用研究项目(SYG201807)资助~~

关键词深度强化学习深度Q网络动态跳帧优先级经验重放 deep reinforcement learning Deep Q-Network dynamic frame skip prioritized experience replay

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献6

1焦李成,杨淑媛,刘芳,王士刚,冯志玺.神经网络七十年:回顾与展望[J].计算机学报,2016,39(8):1697-1716. 被引量：360
2刘全,翟建伟,钟珊,章宗长,周倩,章鹏.一种基于视觉注意力机制的深度循环Q网络模型[J].计算机学报,2017,40(6):1353-1366. 被引量：19
3梁斌,刘全,徐进,周倩,章鹏.基于多注意力卷积神经网络的特定目标情感分析[J].计算机研究与发展,2017,54(8):1724-1735. 被引量：134
4刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：444
5李彦冬,郝宗波,雷航.卷积神经网络研究综述[J].计算机应用,2016,36(9):2508-2515. 被引量：544
6余凯,贾磊,陈雨强,徐伟.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804. 被引量：596

二级参考文献97

1魏英姿 ,赵明扬 .一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771. 被引量：19
2王守觉,曹文明.半导体神经计算机的硬件实现及其在连续语音识别中的应用[J].电子学报,2006,34(2):267-271. 被引量：3
3高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
4MarkoffJ. How many computers to identify a cat?[NJ The New York Times, 2012-06-25.
5MarkoffJ. Scientists see promise in deep-learning programs[NJ. The New York Times, 2012-11-23.
6李彦宏.2012百度年会主题报告:相信技术的力量[R].北京:百度,2013.
710 Breakthrough Technologies 2013[N]. MIT Technology Review, 2013-04-23.
8Rumelhart D, Hinton G, Williams R. Learning representations by back-propagating errors[J]. Nature. 1986, 323(6088): 533-536.
9Hinton G, Salakhutdinov R. Reducing the dimensionality of data with neural networks[J]. Science. 2006, 313(504). Doi: 10. 1l26/science. 1127647.
10Dahl G. Yu Dong, Deng u, et a1. Context-dependent pre?trained deep neural networks for large vocabulary speech recognition[J]. IEEE Trans on Audio, Speech, and Language Processing. 2012, 20 (1): 30-42.

共引文献2037

1贾彦哲.论人工智能研发者过失犯的注意义务[J].华中师范大学研究生学报,2020(2):40-46.
2傅汇乔,唐开强,邓归洲,王鑫鹏,陈春林.基于深度强化学习的六足机器人运动规划[J].智能科学与技术学报,2020(4):361-371. 被引量：2
3刘朝阳,穆朝絮,孙长银.深度强化学习算法与应用研究现状综述[J].智能科学与技术学报,2020(4):314-326. 被引量：37
4韩志豪,汪益兵,张宇,郝永志.基于深度强化学习的船舶航线自动规划[J].中国航海,2021,44(1):100-105. 被引量：9
5朱新乐.基于BP神经网络的绿色供应链优化研究[J].运输经理世界,2023(11):156-158.
6李茹杨,彭慧民,李仁刚,赵坤.强化学习算法与应用综述[J].计算机系统应用,2020,29(12):13-25. 被引量：38
7张克,张文俊,朱蕴文,邢毅雪.基于内联关系的方面级情感分析方法[J].上海大学学报（自然科学版）,2022,28(1):157-169.
8邢毅雪,朱永华,高海燕,周金,张克.基于注意力机制的远程监督实体关系抽取[J].上海大学学报（自然科学版）,2021,27(5):983-992. 被引量：5
9毕思文,Henri Jaffrès,Chandra Sekhar Roychoudhuri.量子遥感发展新态势——世界首次量子遥感国际会议评述[J].全球变化数据学报（中英文）,2019,3(4):317-325. 被引量：1
10侯帅鹏,石英,华逸伦,苏涛.基于改进SSD的行人检测模型[J].武汉理工大学学报,2019,41(7):95-102. 被引量：1

同被引文献11

1于建均,徐骢驰,阮晓钢,门玉森.基于神经网络的机械臂的模仿学习研究[J].控制工程,2017,24(11):2368-2373. 被引量：4
2黄玉钏.基于概率神经网络图像识别的移动机器人控制研究[J].小型微型计算机系统,2019,40(4):908-912. 被引量：11
3刘建伟,高峰,罗雄麟.基于值函数和策略梯度的深度强化学习综述[J].计算机学报,2019,42(6):1406-1438. 被引量：123
4王毅然,经小川,田涛,孙运乾,从帅军.基于强化学习的多Agent路径规划方法研究[J].计算机应用与软件,2019,36(8):165-171. 被引量：24
5张智,翁宗南,苏丽,光正慧.室内机器人避碰路径规划[J].小型微型计算机系统,2019,40(10):2077-2081. 被引量：16
6金海东,刘全,陈冬火.一种带自适应学习率的综合随机梯度下降Q-学习方法[J].计算机学报,2019,42(10):2203-2215. 被引量：14
7王赞,闫明,刘爽,陈俊洁,张栋迪,吴卓,陈翔.深度神经网络测试研究综述[J].软件学报,2020,31(5):1255-1275. 被引量：33
8方梦琳,唐文兵,黄鸿云,丁佐华.基于模糊信息分解与控制规则的移动机器人沿墙导航[J].计算机科学,2020,47(S01):79-83. 被引量：4
9汤自林,高霄,肖晓晖.基于模仿学习的变刚度人机协作搬运控制[J].浙江大学学报（工学版）,2021,55(11):2091-2099. 被引量：5
10张永梅,赵家瑞,吴爱燕.好奇心驱动的深度强化学习机器人路径规划算法[J].科学技术与工程,2022,22(25):11075-11083. 被引量：6

引证文献2

1张茜,王洪格,倪亮.基于离线模型预训练学习的改进DDPG算法[J].计算机工程与设计,2022,43(5):1451-1458.
2宋紫阳,李军怀,王怀军,苏鑫,于蕾.基于路径模仿和SAC强化学习的机械臂路径规划算法[J].计算机应用,2024,44(2):439-444. 被引量：1

二级引证文献1

1马永红,魏敏敏.基于5G技术的校园无人车配送系统设计与实施研究[J].物流科技,2024,47(11):60-63.

1孙雪红.基于动态跳帧和运动矢量修正的视频转换编码方法研究[J].电子设计工程,2018,26(14):186-189.
2王怀英,彭英.老年痴呆患者的精神护理以及家庭护理[J].医学信息（医学与计算机应用）,2014,0(8):289-289. 被引量：1
3徐进,刘全,章宗长,梁斌,周倩.基于多重门限机制的异步深度强化学习[J].计算机学报,2019,42(3):636-653. 被引量：1
4白辰甲,刘鹏,赵巍,唐降龙.基于TD-error自适应校正的深度Q学习主动采样方法[J].计算机研究与发展,2019,56(2):262-280. 被引量：9
5罗仁明.基于核心素养的小学语文教学探索[J].中国校外教育,2019,0(35):104-105. 被引量：4
6李若天航.基于强化学习的智能浇灌系统[J].现代化农业,2019,0(11):15-18.
7张晓璐.如何在中职英语教学中培养学生的文化意识[J].新教育时代电子杂志（学生版）,2019(25):196-196.
8邵军英.如何在小组合作学习中进行分层教学[J].试题与研究（教学论坛）,2019(31):129-129.
9蒋鸣和.创新人才是如何培养的[J].中国信息技术教育,2019(3):1-1. 被引量：1
10张智.悉见科技描绘AR时代新画卷[J].创业邦,2019,0(11):28-29.

计算机学报

2019年第11期

浏览历史

内容加载中请稍等...

基于线性动态跳帧的深度双Q网络被引量：2

参考文献6

二级参考文献97

共引文献2037

同被引文献11

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于线性动态跳帧的深度双Q网络 被引量：2

参考文献6

二级参考文献97

共引文献2037

同被引文献11

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于线性动态跳帧的深度双Q网络被引量：2