Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks(CNNs).The transformers demonstrate the ability to model long-range dependencies by ut...Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks(CNNs).The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism.This study aims to provide a comprehensive survey of recent transformerbased approaches in image and video applications,as well as diffusion models.We begin by discussing existing surveys of vision transformers and comparing them to this work.Then,we review the main components of a vanilla transformer network,including the self-attention mechanism,feed-forward network,position encoding,etc.In the main part of this survey,we review recent transformer-based models in three categories:Transformer for downstream tasks,Vision Transformer for Generation,and Vision Transformer for Segmentation.We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models.We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets.Finally,we explore some future research directions to further improve the field.展开更多
The development of autonomous vehicles has become one of the greatest research endeavors in recent years. These vehicles rely on many complex systems working in tandem to make decisions. For practical use and safety r...The development of autonomous vehicles has become one of the greatest research endeavors in recent years. These vehicles rely on many complex systems working in tandem to make decisions. For practical use and safety reasons, these systems must not only be accurate, but also quickly detect changes in the surrounding environment. In autonomous vehicle research, the environment perception system is one of the key components of development. Environment perception systems allow the vehicle to understand its surroundings. This is done by using cameras, light detection and ranging (LiDAR), with other sensor systems and modalities. Deep learning computer vision algorithms have been shown to be the strongest tool for translating camera data into accurate and safe traversability decisions regarding the environment surrounding a vehicle. In order for a vehicle to safely traverse an area in real time, these computer vision algorithms must be accurate and have low latency. While much research has studied autonomous driving for traversing well-structured urban environments, limited research exists evaluating perception system improvements in off-road settings. This research aims to investigate the adaptability of several existing deep-learning architectures for semantic segmentation in off-road environments. Previous studies of two Convolutional Neural Network (CNN) architectures are included for comparison with new evaluation of Vision Transformer (ViT) architectures for semantic segmentation. Our results demonstrate viability of ViT architectures for off-road perception systems, having a strong segmentation accuracy, lower inference speed and memory footprint compared to previous results with CNN architectures.展开更多
针对主流Transformer网络仅对输入像素块做自注意力计算而忽略了不同像素块间的信息交互,以及输入尺度单一导致局部特征细节模糊的问题,本文提出一种基于Transformer并用于处理视觉任务的主干网络ConvFormer. ConvFormer通过所设计的多...针对主流Transformer网络仅对输入像素块做自注意力计算而忽略了不同像素块间的信息交互,以及输入尺度单一导致局部特征细节模糊的问题,本文提出一种基于Transformer并用于处理视觉任务的主干网络ConvFormer. ConvFormer通过所设计的多尺度混洗自注意力模块(Channel-Shuffle and Multi-Scale attention,CSMS)和动态相对位置编码模块(Dynamic Relative Position Coding,DRPC)来聚合多尺度像素块间的语义信息,并在前馈网络中引入深度卷积提高网络的局部建模能力.在公开数据集ImageNet-1K,COCO 2017和ADE20K上分别进行图像分类、目标检测和语义分割实验,ConvFormer-Tiny与不同视觉任务中同量级最优网络RetNetY-4G,Swin-Tiny和ResNet50对比,精度分别提高0.3%,1.4%和0.5%.展开更多
基金supported in part by the National Natural Science Foundation of China under Grants 61502162,61702175,and 61772184in part by the Fund of the State Key Laboratory of Geo-information Engineering under Grant SKLGIE2016-M-4-2+4 种基金in part by the Hunan Natural Science Foundation of China under Grant 2018JJ2059in part by the Key R&D Project of Hunan Province of China under Grant 2018GK2014in part by the Open Fund of the State Key Laboratory of Integrated Services Networks under Grant ISN17-14Chinese Scholarship Council(CSC)through College of Computer Science and Electronic Engineering,Changsha,410082Hunan University with Grant CSC No.2018GXZ020784.
文摘Transformer models have emerged as dominant networks for various tasks in computer vision compared to Convolutional Neural Networks(CNNs).The transformers demonstrate the ability to model long-range dependencies by utilizing a self-attention mechanism.This study aims to provide a comprehensive survey of recent transformerbased approaches in image and video applications,as well as diffusion models.We begin by discussing existing surveys of vision transformers and comparing them to this work.Then,we review the main components of a vanilla transformer network,including the self-attention mechanism,feed-forward network,position encoding,etc.In the main part of this survey,we review recent transformer-based models in three categories:Transformer for downstream tasks,Vision Transformer for Generation,and Vision Transformer for Segmentation.We also provide a comprehensive overview of recent transformer models for video tasks and diffusion models.We compare the performance of various hierarchical transformer networks for multiple tasks on popular benchmark datasets.Finally,we explore some future research directions to further improve the field.
文摘The development of autonomous vehicles has become one of the greatest research endeavors in recent years. These vehicles rely on many complex systems working in tandem to make decisions. For practical use and safety reasons, these systems must not only be accurate, but also quickly detect changes in the surrounding environment. In autonomous vehicle research, the environment perception system is one of the key components of development. Environment perception systems allow the vehicle to understand its surroundings. This is done by using cameras, light detection and ranging (LiDAR), with other sensor systems and modalities. Deep learning computer vision algorithms have been shown to be the strongest tool for translating camera data into accurate and safe traversability decisions regarding the environment surrounding a vehicle. In order for a vehicle to safely traverse an area in real time, these computer vision algorithms must be accurate and have low latency. While much research has studied autonomous driving for traversing well-structured urban environments, limited research exists evaluating perception system improvements in off-road settings. This research aims to investigate the adaptability of several existing deep-learning architectures for semantic segmentation in off-road environments. Previous studies of two Convolutional Neural Network (CNN) architectures are included for comparison with new evaluation of Vision Transformer (ViT) architectures for semantic segmentation. Our results demonstrate viability of ViT architectures for off-road perception systems, having a strong segmentation accuracy, lower inference speed and memory footprint compared to previous results with CNN architectures.
文摘针对主流Transformer网络仅对输入像素块做自注意力计算而忽略了不同像素块间的信息交互,以及输入尺度单一导致局部特征细节模糊的问题,本文提出一种基于Transformer并用于处理视觉任务的主干网络ConvFormer. ConvFormer通过所设计的多尺度混洗自注意力模块(Channel-Shuffle and Multi-Scale attention,CSMS)和动态相对位置编码模块(Dynamic Relative Position Coding,DRPC)来聚合多尺度像素块间的语义信息,并在前馈网络中引入深度卷积提高网络的局部建模能力.在公开数据集ImageNet-1K,COCO 2017和ADE20K上分别进行图像分类、目标检测和语义分割实验,ConvFormer-Tiny与不同视觉任务中同量级最优网络RetNetY-4G,Swin-Tiny和ResNet50对比,精度分别提高0.3%,1.4%和0.5%.