FMSWFormer:基于频率分离和自适应多尺度窗口的视觉Transformer

FMSWFormer:Visual Transformer with Frequency Separation and Adaptive Multi-Scale Window Attention

下载PDF

导出

摘要由于Vision Transformer具有二次方的patch复杂度和较差的局部归纳偏置,导致需要大量的数据和更专业的数据增强策略及更多的训练技巧来超越高效卷积网络。为了解决这些问题,从多尺度特征提取和图像频率的角度进行研究,提出具有轻量注意力机制的FMSWFormer。FMSWFormer采用卷积-自注意力机制混合模块构建起不同频率间的通信,通过窗口划分实现局部注意力机制,以此限制过高的计算成本;参考自适应尺度感知卷积的做法,并创新性地将多尺度算子引入到自注意力计算中,从而实现了多头自注意力机制的自适应尺度感知能力。在各种基准识别任务数据集上进行广泛的实验,结果表明了FMSWFormer的有效性,在多个数据集中都取得了优越的性能,且不增加时间成本。其中在CIFAR100数据集上,FMSWFormer比SepViT的性能高出4.2%,延迟降低了47.8%;在参数量比EfficientNetv2减少了22%的情况下,FMSWFormer的性能依然能高出3.94%。 Vision Transformer s quadratic patch complexity and weaker local inductive bias,require a large amount of data,more sophisticated data augmentation strategies,and additional training tricks to surpass efficient convolutional networks.To address these issues,this paper investigates multi-scale feature extraction and image frequency perspectives and introduces FMSWFormer.FMSWFormer employs convolution-self-attention hybrid modules to establish communication between different frequencies and implements local attention mechanisms through window partitioning to constrain excessive computational costs.One of the main contributions of the paper is the incorporation of multi-scale operators into self-attention computation.This technique is inspired by adaptive-scale perception convolution,and it serves to enhance the adaptive-scale perception capabilities of multi-head self-attention mechanisms.By doing so,FMSWFormer effectively combines the strong local inductive bias of CNNs with the dynamic long-range dependency modeling abilities of transformers.Extensive experiments on various benchmark recognition task datasets demonstrate the effectiveness of FMSWFormer,which achieves superior performance on multiple datasets without increasing time costs.Notably,FMSWFormer outperforms SepViT by 4.2%on the CIFAR100 dataset while reducing latency by 47.8%.Even with 22%fewer parameters than EfficientNetv2,FMSWFormer s performance still surpasses it by 3.94%.

作者蔡岱立谢维波 CAI Daili;XIE Weibo(College of Computer Science and Technology,Huaqiao University,Xiamen 361021,China)

机构地区华侨大学计算机科学与技术学院

出处《集美大学学报（自然科学版）》 CAS 2023年第6期568-576,共9页 Journal of Jimei University：Natural Science

基金国家自然科学基金项目(61271383)。

关键词窗口自注意力机制多尺度特征提取深度学习卷积神经网络图像高低频解耦 window self-attention multi-scale feature extraction deep learning convolutional neural networks image high and low frequency decoupling

分类号 TP391.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1Wenhai Wang,Enze Xie,Xiang Li,Deng-Ping Fan,Kaitao Song,Ding Liang,Tong Lu,Ping Luo,Ling Shao.PVT v2:Improved baselines with Pyramid Vision Transformer[J].Computational Visual Media,2022,8(3):415-424. 被引量：64

共引文献63

1李敏,乔志远,杨易鑫.基于光学遥感影像的舰船检测研究综述[J].网络安全与数据治理,2023,42(S01):106-114.
2张显杰,张之明.基于卷积神经网络和Transformer的手写体英文文本识别[J].计算机应用,2022,42(8):2394-2400. 被引量：3
3薛相全,庞明宝.基于Transformer-ESIM的高速公路交通状态识别模型[J].物流科技,2022,45(17):71-75.
4单维锋,李志扬,陈俊,刘海军,张秀霞,邢丽莉,胡秀娟,夏庆新,夏金铸.应用卷积神经网络和自注意力机制识别地磁场干扰事件[J].地震地磁观测与研究,2022,43(5):49-63.
5Ge-Peng Ji,Guobao Xiao,Yu-Cheng Chou,Deng-Ping Fan,Kai Zhao,Geng Chen,Luc Van Gool.Video Polyp Segmentation: A Deep Learning Perspective[J].Machine Intelligence Research,2022,19(6):531-549. 被引量：11
6刘洋,李相国,连良秀.基于AIOT的安全生产监管平台关键技术研究[J].网络安全技术与应用,2022(12):7-9. 被引量：1
7李翔,张涛,张哲,魏宏杨,钱育蓉.Transformer在计算机视觉领域的研究综述[J].计算机工程与应用,2023,59(1):1-14. 被引量：13
8冯珺,彭梁英,赵帅,潘司晨,郭雪强.基于孪生神经网络的小样本目标检测综述[J].河北科技大学学报,2022,43(6):643-650. 被引量：2
9王甜甜,史卫亚,张世强,张绍文.采用双支路和Transformer的视杯视盘分割方法[J].科学技术与工程,2023,23(6):2499-2508. 被引量：1
10李清格,杨小冈,卢瑞涛,王思宇,谢学立,张涛.计算机视觉中的Transformer发展综述[J].小型微型计算机系统,2023,44(4):850-861. 被引量：13

1李紫桐,赵健康,徐静冉,龙海辉,刘传奇.基于改进Swin transformer的遥感图像融合方法[J].光子学报,2023,52(11):248-262. 被引量：2
2姜虹,马姣姣,姚红革,程嗣怡,陈游,喻钧.融合时空上下文信息的强化学习小目标快速搜索[J].电子学报,2023,51(11):3176-3186.
3沈蔚,杨智松,廖德亮,卢泉水,徐康进.基于阈值自适应确定的多波束点云滤波算法[J].海洋测绘,2023,43(6):6-11.

集美大学学报（自然科学版）

2023年第6期

浏览历史

内容加载中请稍等...

FMSWFormer:基于频率分离和自适应多尺度窗口的视觉Transformer

参考文献1

共引文献63

相关作者

相关机构

相关主题

浏览历史