基于双重注意力机制的异步优势行动者评论家算法被引量：4

Asynchronous Advantage Actor-Critic with Double Attention Mechanisms

下载PDF

导出

摘要深度强化学习是目前机器学习领域发展最快的技术之一.传统的深度强化学习方法在处理高维度大状态的空间任务时,庞大的计算量导致其训练时间过长.虽然异步深度强化学习利用异步方法极大缩短了训练时间,但会忽略某些更具价值的图像区域和图像特征.针对上述问题,本文提出了一种基于双重注意力机制的异步优势行动者评论家算法.新算法利用特征注意力机制和视觉注意力机制来改进传统的异步深度强化学习模型.其中,特征注意力机制为卷积神经网络卷积后的所有特征图设置不同的权重,使得智能体聚焦于重要的图像特征;同时,视觉注意力机制为图像不同区域设置权重参数,权重高的区域表示该区域信息对智能体后续的策略学习有重要价值,帮助智能体更高效地学习到最优策略.新算法引入双重注意力机制,从表层和深层两个角度对图像进行编码表征,帮助智能体将聚焦点集中在重要的图像区域和图像特征上.最后,通过Atari 2600部分经典实验验证了基于双重注意力机制的异步优势行动者评论家算法的有效性. In recent years,deep reinforcement learning(DRL),which combines deep learning and reinforcement learning together,is a new research hotspot in artificial intelligence.As DRL takes advantage of deep learning,it is able to take raw images as input,which extends applications of reinforcement learning.At the mean while time,DRL retains the advantages of reinforcement learning in application such as intelligent policy decision or robotic control.However,traditional DRL such as deep Q-network(DQN)or double deep Q-network(DDQN),could hardly deal with complex tasks with high-dimensional state in a short time.Researchers have proposed many methods to solve this problem,and asynchronous advantage actor-critic(A3 C)is one of the most used algorithm.As we know,traditional asynchronous deep reinforcement learning can use multi-threading techniques to reduce large amounts of training time.However,when it comes to high-dimensional large-state space tasks,some valuable and important image areas and features are often ignored,such as Atari 2600 games.The reason is that Agent’s attention is focused on the entire input image and all features of the image,without any emphases on some important features.To handle this problem,we employ the attention mechanism to ameliorate the performance of traditional asynchronous deep reinforcement learning models.In recent years,inspired by human vision,the attention mechanism has been extensively used in machine translation,image recognition and speech recognition,becoming one of the most noteworthy and in-depth research techniques in the area of deep learning technologies.Based on this,we put forward an asynchronous advantage actor-critic with double attention mechanisms(DAM-A3 C).In DAM-A3 C,there are two main characteristics:visual attention mechanism(VAM)and feature attention mechanism(FAM).First,the application of visual attention mechanism can enable Agent to adaptively engage in the image region,especially in those more important areas which can enhance the cumulative reward at each moment,reducing the computational cost of the network’s training and finally accelerating the process of learning the approximate optimal strategy.Second,via the exertion of FAM,an asynchronous advantage actor-critic is expected to pay more attention to those features with more value.What we know is that different convolution kernels can generate different feature maps by operating convolution on the image in convolutional neural network.And feature maps completely describe the image from different features.The traditional training of convolutional neural network treats each extracted feature equally,which means all features have the same proportion,instead of different levels of focus according to their value.However,some image features have a crucial role in the description of images,such as color features,shape features and spatial relationship features,etc.In order to alleviate this problem,FAM can assist Agent to converge on feature maps with rich values,which will facilitate Agent to make correct decisions.To sum up,we introduce FAM in VAM-A3 C model and propose DAM-A3 C model.DAM-A3 Cutilizes visual attention mechanism and feature attention mechanism to enable Agent to concentrate on the important areas and important features of the image,which advances the network model to recognize important information and key features of the image in a short time.We select some classic Atari 2600 games as experimental objects to evaluate the performance of the new model.The experimental result shows that the new model has better performance than the traditional asynchronous advantage actor-critic algorithm in experimental tasks.

作者凌兴宏李杰朱斐刘全伏玉琛 LING Xing-Hong;LI Jie;ZHU Fei;LIU Quan;FU Yu-Chen(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210000;School of Computer Science and Engineering,Changshu Institute of Technology,Changshu,Jiangsu 215500)

机构地区苏州大学计算机科学与技术学院苏州大学江苏省计算机信息处理技术重点实验室吉林大学符号计算与知识工程教育部重点实验室软件新技术与产业化协同创新中心常熟理工学院计算机科学与工程学院

出处《计算机学报》 EI CSCD 北大核心 2020年第1期93-106,共14页 Chinese Journal of Computers

基金国家自然科学基金（61772355,61303108,61373094）江苏省高等学校自然科学研究重大项目（17KJA520004）吉林大学符号计算与知识工程教育部重点实验室资助项目（93K172014K04）苏州市应用基础研究计划工业部分（SYG201422）苏州市民生科技项目（SS201736）江苏高校优势学科建设工程资助项目资助~~

关键词注意力机制双重注意力机制行动者评论家异步优势行动者评论家异步深度强化学习 attention mechanism double attention mechanisms actor-critic asynchronous advantage actor-critic asynchronous deep reinforcement learning

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献5

1余凯,贾磊,陈雨强,徐伟.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804. 被引量：610
2黎亚雄,张坚强,潘登,胡惮.基于RNN-RBM语言模型的语音识别研究[J].计算机研究与发展,2014,51(9):1936-1944. 被引量：27
3梁淑芬,刘银华,李立琛.基于LBP和深度学习的非限制条件下人脸识别算法[J].通信学报,2014,35(6):154-160. 被引量：52
4高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
5傅启明,刘全,王辉,肖飞,于俊,李娇.一种基于线性函数逼近的离策略Q(λ)算法[J].计算机学报,2014,37(3):677-686. 被引量：26

二级参考文献54

1Puterman M L.Markov Decision Process:Discrete Dynamic Dtochastic Programming.New-York:Wiley,1994
2Kaya M,Alhajj R.Fuzzy olap association rules mining based modular reinforcement learning approach for multiagent systems.IEEE Transactions on Systems,Man and Cybernetics part B:Cybernetics,2005,35(2):326-338
3Singh S,Bertsekas D.Reinforcement learning for dynamic channel allocation in cellular telephone systems//Mozer M C,Jordan M L,Petsche T.Proceedings of the NIPS-9.Cambridge MA:MIT Press,1997:974
4Vengerov D N,Berenji H R.A fuzzy reinforcement learning approach to power control in wireless transmitters.IEEE Transactions on Systems,Man,and Cybernetics part B:Cybernetics,2005,35(4):768-778
5Critesl R H,Barto A G.Elevator group control using multiple reinforcement learning Agents.Machine Learning,1998,33(2/3):235-262
6Kaelbling L P,Littman M L,Moore A P.Reinforcement learning:A survey.Journal of Artificial Intelligence Research,1996,4:237-285
7Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge MA:MIT Press,1998
8Schwartz A.A reinforcement learning method for maximizing undiscounted rewards//Huns M N,Singh M P eds.Proceedings of the 10th Annual Conference on Machine Learning.San Francisco:Morgan Kaufmann,1993:298-305
9Tadepalli P,Ok D.Model-based average reward reinforcement learning.Artificial Intelligence,1998,100(1/2):177-224
10Gosavi A.Reinforcement learning for long run average cost.European Journal of Operational Research,2004,155 (3):654-674

共引文献730

1贾彦哲.论人工智能研发者过失犯的注意义务[J].华中师范大学研究生学报,2020(2):40-46.
2周楠,艾剑良.基于HMM和RNN的无人机语音控制方案与仿真研究[J].系统仿真学报,2020,32(3):464-471. 被引量：12
3毕思文,Henri Jaffrès,Chandra Sekhar Roychoudhuri.量子遥感发展新态势——世界首次量子遥感国际会议评述[J].全球变化数据学报（中英文）,2019,3(4):317-325. 被引量：1
4Di Cao,Weihao Hu,Junbo Zhao,Guozhou Zhang,Bin Zhang,Zhou Liu,Zhe Chen,Frede Blaabjerg.Reinforcement Learning and Its Applications in Modern Power and Energy Systems: A Review[J].Journal of Modern Power Systems and Clean Energy,2020,8(6):1029-1042. 被引量：27
5范敏,胥小波,聂小明.基于字符级扩张卷积网络的Web攻击检测方法[J].计算机应用研究,2020,37(S02):234-237. 被引量：4
6孟威,尉永清,刘文锋.基于CRT机制混合神经网络的特定目标情感分析[J].计算机应用研究,2020,37(2):360-364. 被引量：1
7华夏,王新晴,马昭烨,王东,邵发明.基于递归神经网络的视频多目标检测技术[J].计算机应用研究,2020,37(2):615-620. 被引量：8
8刘树霄,衣立,张苏平,时晓曚,薛允传.基于全卷积神经网络方法的日间黄海海雾卫星反演研究[J].海洋湖沼通报,2019(6):13-22. 被引量：11
9苏翎菲,化永朝,董希旺,任章.人与无人机集群多模态智能交互方法[J].航空学报,2022,43(S01):129-142. 被引量：4
10王海涛.自主无人系统——概念、体系架构和设计要素[J].电信快报,2021(5):6-9.

同被引文献28

1王粟,江鑫,曾亮,常雨芳.基于VMD-DESN-MSGP模型的超短期光伏功率预测[J].电网技术,2020,44(3):917-926. 被引量：47
2刘媛媛,张硕,于海业,王跃勇,王佳木.基于语义分割的复杂场景下的秸秆检测[J].光学精密工程,2020,28(1):200-211. 被引量：18
3郭建斌,王楠,王泽民.水工钢结构腐蚀的图像识别技术[J].河海大学学报（自然科学版）,2012,40(5):539-543. 被引量：8
4黎敏,林湘宁,张哲原,翁汉琍.超短期光伏出力区间预测算法及其应用[J].电力系统自动化,2019,43(3):10-16. 被引量：43
5刘红波,刘东宇,徐杰.天津新港船闸桥锈蚀检测与结构性能评估[J].天津大学学报（自然科学与工程技术版）,2015,48(B07):147-150. 被引量：10
6杨威,张建林,徐智勇,赵春梅.一种改进的Focal Loss在语义分割上的应用[J].半导体光电,2019,0(4):555-559. 被引量：10
7谭津,邓长虹,杨威,梁宁,李丰君.微电网光伏发电的Adaboost天气聚类超短期预测方法[J].电力系统自动化,2017,41(21):33-39. 被引量：41
8刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：473
9陆继翔,张琪培,杨志宏,涂孟夫,陆进军,彭晖.基于CNN-LSTM混合神经网络模型的短期负荷预测方法[J].电力系统自动化,2019,43(8):131-137. 被引量：338
10王达磊,彭博,潘玥,陈艾荣.基于深度神经网络的锈蚀图像分割与定量分析[J].华南理工大学学报（自然科学版）,2018,46(12):121-127. 被引量：22

引证文献4

1杨晶显,张帅,刘继春,刘俊勇,向月,韩晓言.基于VMD和双重注意力机制LSTM的短期光伏功率预测[J].电力系统自动化,2021,45(3):174-182. 被引量：78
2柴来,张婷婷,董会,王楠.基于分区缓存区重放与多线程交互的多智能体深度强化学习算法[J].计算机学报,2021,44(6):1140-1152. 被引量：5
3陈法法,成孟腾,杨蕴鹏,陈保家,肖文荣,肖能齐.融合双注意力机制和U-Net网络的锈蚀图像分割[J].西安交通大学学报,2021,55(12):119-128. 被引量：12
4杜威,丁世飞,郭丽丽,张健,丁玲.基于价值函数分解和通信学习机制的异构多智能体强化学习方法[J].计算机学报,2024,47(6):1304-1322.

二级引证文献95

1单锦宁,王琛淇,王顺江,刘天泽.基于气象修正模型的短期光伏功率预测方法[J].辽宁工程技术大学学报（自然科学版）,2023(2):242-249. 被引量：1
2潘璐璐,茅大钧,陈思勤.基于改进Elman神经网络的发电厂发电量预测[J].湖北电力,2023,47(2):103-110. 被引量：5
3高娇娇,于军琪,赵安军.基于VMD-HS-BP的建筑光伏发电功率预测模型[J].建设科技,2021(6):68-71.
4杨昭,张钢,赵俊杰,张灏,蔺奕存.基于变分模态分解和改进粒子群算法优化最小二乘支持向量机的短期电价预测[J].电气技术,2021,22(10):11-16. 被引量：10
5董新伟,卜智龙,陈鸣慧,鹿文蓬,年珩.基于VMD-LSTMQR的滚动母线负荷区间预测[J].电力工程技术,2021,40(6):9-17. 被引量：4
6孟安波,陈顺,王陈恩,丁伟锋,蔡涌烽,符嘉晋,周华敏.基于混沌CSO优化时序注意力GRU模型的超短期风电功率预测[J].电网技术,2021,45(12):4692-4700. 被引量：20
7朱继忠,董瀚江,李盛林,陈梓瑜,骆腾燕.数据驱动的综合能源系统负荷预测综述[J].中国电机工程学报,2021,41(23):7905-7923. 被引量：83
8李芬,周尔畅,孙改平,白永清,童力,刘邦银,赵晋斌.一种新型天气分型方法及其在光伏功率预测中的应用[J].上海交通大学学报,2021,55(12):1510-1519. 被引量：10
9崔树银,汪昕杰.基于特征工程的集成学习短期光伏功率预测[J].科学技术与工程,2022,22(2):532-539. 被引量：13
10杨洁.一种轻量级阴影检测方法[J].网络安全技术与应用,2022(2):43-44.

1薛炜,刘惠义.基于一种视觉注意力机制的图像描述方法[J].信息技术,2020,44(1):63-66. 被引量：1
2聂衍刚,利振华,窦凯.预先承诺的理论解释及其神经机制[J].心理科学,2019,42(5):1202-1208. 被引量：2
3周苏,许科荣,支雪磊.四轮独立驱动/转向电动汽车的无线控制系统设计[J].汽车技术,2020(2):1-5.
4吴静,罗杨.动态调整惯性权重的粒子群算法优化[J].计算机系统应用,2019,28(12):184-188. 被引量：18
5赖建辉.基于D3QN的交通信号控制策略[J].计算机科学,2019,46(S11):117-121. 被引量：8
6成雨含,张友讯,戴世诚.四通道同步心音信号采集装置设计及特征识别[J].南京邮电大学学报（自然科学版）,2019,39(5):7-13. 被引量：3
7刘波,于广良,徐建,谭彦亮,江耿丰,李建平.基于调试支持单元的星载控制计算机软故障恢复设计[J].测控技术,2020,39(1):42-45. 被引量：1
8杜超,刘桂华.改进的VGG网络的二极管玻壳图像缺陷检测[J].图学学报,2019,40(6):1087-1092. 被引量：4
9柳长安,艾壮,赵丽娟.基于网格运动统计的自适应图像特征匹配算法[J].华中科技大学学报（自然科学版）,2020,48(1):37-40. 被引量：13
10王智,周斌.自由溢流水库对下游防洪保护能力的一种插值估算方法[J].江西水利科技,2019,45(6):420-424. 被引量：1

计算机学报

2020年第1期

浏览历史

内容加载中请稍等...

基于双重注意力机制的异步优势行动者评论家算法被引量：4

参考文献5

二级参考文献54

共引文献730

同被引文献28

引证文献4

二级引证文献95

相关作者

相关机构

相关主题

浏览历史

基于双重注意力机制的异步优势行动者评论家算法 被引量：4

参考文献5

二级参考文献54

共引文献730

同被引文献28

引证文献4

二级引证文献95

相关作者

相关机构

相关主题

浏览历史

基于双重注意力机制的异步优势行动者评论家算法被引量：4