摘要
近年来,Transformer已逐渐成为计算机视觉领域的主流架构。其远程表达能力和高并行性赋予了它在性能上与卷积神经网络相媲美的能力。然而,在当前阶段,将注意力机制应用于计算机视觉仍存在两个主要问题:一是计算复杂度过高;二是需要大量的训练数据。为解决这些问题,提出一种基于类别查询的视觉Transformer模型(OB_ViT)。创新之处主要体现在以下两个方面:一是引入可学习的类别查询;二是采用基于匈牙利算法的损失函数。具体而言,一种可学习的类别查询作为解码器的输入,通过此方法,可以对目标类别与全局图像上下文之间的关系进行推理。此外,通过采用匈牙利算法强制实现唯一预测,确保每个类别查询仅学习一种目标类别。在Cifar10和5分类Flower数据集上的图像分类实验表明,与ViT和Resnet50相比,OB_ViT模型在参数量减少的同时,学习准确率显著提高。例如,在Cifar10数据集上,参数量减少15%,准确率提升22%。
In recent years,Transformer has gradually become the mainstream architecture in computer vision.Its broad expressiveness and high parallelism give it the ability to match the performance of convolutional neural networks(CNNs).However,there are two main problems in applying the attention mechanism to computer vision at the current stage:high computational complexity and the need for a large amount of training data.To address these issues,a category-query based visual Transformer model(OB_ViT)was proposed.The innovation lies in two aspects:the introduction of learnable category queries and the use of a loss function based on the Hungarian algorithm.Specifically,a learnable category query was used as input to the decoder,which allows reasoning about the relationship between target categories and the global image context.In addition,the Hungarian algorithm was used to enforce unique predictions,ensuring that each category query learns only one target category.Experimental results on the Cifar10 and 5-class Flower image classification datasets showed that the OB_ViT model achieves significantly improved learning accuracy while reducing the number of parameters compared to ViT and ResNet50.For example,on the Cifar10 dataset,there was a 15%reduction in parameters and a 22%improvement in accuracy.
作者
姜春雨
王伟
JIANG Chunyu;WANG Wei(School of Economics and Management,Jilin Institute of Chemical Technology,Jilin City 132022,China)
出处
《吉林化工学院学报》
CAS
2024年第3期62-67,共6页
Journal of Jilin Institute of Chemical Technology