基于Vision Transformer的时空卷积网络设计

Spatiotemporal Convolutional Network Design Based on Vision Transformer

下载PDF

导出

摘要目前主流人体动作识别大部分都是基于卷积神经网络(Convolutional Neural Network,CNN)实现,而CNN容易忽略视频中的空间位置信息,从而降低了视频空间频域中动作识别能力。同时传统CNN不能快速定位到关键的特征位置,并且在训练过程中不能并行计算导致效率低。为了解决传统CNN在处理时间频域和多并行计算问题,提出了基于视觉Transformer(Vision Transformer,ViT)和3D卷积网络学习时空特征(Learning Spatiotemporal Features with 3D Convolutional Network,C3D)的人体动作识别算法。使用C3D提取视频的多维特征图、ViT的特征切片窗口对多维特征进行全局特征分割;使用Transformer的编码-解码模块对视频中人体动作进行预测。实验结果表明,所提的人体动作识别算法在UCF-101、HMDB51数据集上提高了动作识别的准确率。 At present,the mainstream human action recognition is mostly based on Convolutional Neural Network(CNN),which tend to ignore the spatial position information in the video,thus reducing the action recognition ability in the spatial frequency domain of video.At the same time,the traditional CNN can not locate the key feature position quickly,and the parallel computation efficiency is low in the training process.In order to solve the problems of processing time-frequency domain and multi-parallel computation in traditional CNN,a human action recognition algorithm based on the Vision Transformer(ViT)and Learning Spatiotemporal Features with 3D Convolutional Network(C3D)is proposed.Firstly,the multi-dimensional feature map of the video is extracted by using C3D.The feature slice window of ViT is used for global feature segmentation of multi-dimensional features.Finally,the coding-decoding module of transformer is used to predict human actions in the video.The experimental results show that the accuracy of action recognition is improved by the human action recognition algorithm on UCF-101 and HMDB51 datasets.

作者谢英红郝岩韩晓微高强阴彪王朝辉 XIE Yinghong;HAO Yan;HAN Xiaowei;GAO Qiang;YIN Biao;WANG Zhaohui(School of Information Engineering,Shenyang University,Shenyang 110044,China)

机构地区沈阳大学信息工程学院

出处《计算机与网络》 2024年第4期283-288,共6页 Computer & Network

关键词动作识别视觉Transformer 卷积神经网络 3D卷积网络学习时空特征注意力机制 action recognition ViT CNN C3D attention mechanism

分类号 TP751.1 [自动化与计算机技术—检测技术与自动化装置]

引文网络
相关文献

1张富强,曾夏,白筠妍,丁凯.多模态数据融合的加工作业动态手势识别方法[J].郑州大学学报（工学版）,2024,45(5):30-36.
2张孟思.局部特征互补的遮挡行人重识别研究[J].电脑知识与技术,2024,20(22):7-10.
3肖逸陶.喝水其实不简单[J].幼儿教育,2024(25):51-53.
4大源.维生素B_(12)再解构[J].健与美,2024(9):16-17.
5Amol Dattatray Dhaygude,Gaurav Kumar Ameta,Ihtiram Raza Khan,Pavitar Parkash Singh,Renato R.Maaliw III,Natrayan Lakshmaiya,Mohammad Shabaz,Muhammad Attique Khan,Hany S.Hussein,Hammam Alshazly.Knowledge‐based deep learning system for classifying Alzheimer's disease for multi‐task learning[J].CAAI Transactions on Intelligence Technology,2024,9(4):805-820.
6蔡凯华,刘倩.船舶加装脱硫系统对机舱通风系统布置的影响[J].广东造船,2024,43(4):58-61.
7郑周桃,杨剑锋,贺孟兰.基于Copula函数的故障相依软件系统可靠性模型[J].计算机仿真,2024,41(8):386-393.
8王世坤,辛雷,杨晨,顾雪宋.基于因子回归分析研究影响棒球击球机械能的下肢生物力学因素[J].医用生物力学,2024,39(4):593-599.
9吴荣文,邓针,邓加东,顾鹏,郑冬财.航空突变截面环件近净轧制环坯设计方法[J].锻压技术,2024,49(8):94-102.

计算机与网络

2024年第4期

浏览历史

内容加载中请稍等...

基于Vision Transformer的时空卷积网络设计

相关作者

相关机构

相关主题

浏览历史