摘要
Transformer所具备的长距离建模能力和并行计算能力使其在自然语言处理领域取得了巨大成功并逐步拓展至计算机视觉等领域.本文以分类任务为切入,介绍了典型视觉Transformer的基本原理和结构,并分析了Transformer与卷积神经网络在连接范围、权重动态性和位置表示能力三方面的区别与联系;同时围绕计算代价、性能提升、训练优化以及结构设计四个方面总结了视觉Transformer研究中的关键问题以及研究进展;并提出了视觉Transformer的一般性框架;然后针对检测和分割两个领域,介绍了视觉Transformer在特征学习、结果产生和真值分配等方面给上层视觉模型设计带来的启发和改变;并对视觉Transformer未来发展方向进行了展望.
Due to its long-range sequence modeling and parallel computing capability,Transformers have achieved significant success in natural language processing and are gradually expanding to computer vision area.Starting from image classification,we introduce the architecture of classic vision Transformer and compare it with convolutional neural networks in connection range,dynamic weights and position representation ability.Then,we summarize existing problems and corresponding solutions in vision Transformers including computational efficiency,performance improvement,optimization and architecture design.Besides,we propose a general architecture of Vision Transformers.For object detection and image segmentation,we discuss Transformer-based models and their roles on feature extraction,result generation and ground-truth assignment.Finally,we point out the development trends of vision Transformers.
作者
田永林
王雨桐
王建功
王晓
王飞跃
TIAN Yong-Lin;WANG Yu-Tong;WANG Jian-Gong;WANG Xiao;WANG Fei-Yue(Department of Automation,University of Science and Technology of China,Hefei 230027;The State Key Laboratory for Management and Control of Complex Systems,Institute of Automation,Chinese Academy of Sciences,Beijing 100190;Qingdao Academy of Intelligent Industries,Qingdao 266000)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2022年第4期957-979,共23页
Acta Automatica Sinica
基金
广东省重点领域研发计划(2020B090921003)
广州市智能网联汽车重大科技专项(202007050002)
国家自然科学基金(U1811463)
英特尔智能网联汽车大学合作研究中心(ICRI-IACV)资助。