摘要
基于深度学习的视频动作识别方法处理时间信息的方式主要有两种,一是利用光流表示相邻帧之间的运动信息,但其无法有效建模长程时间特征,二是利用3D卷积对时空信号进行混合建模,但其引入了大量的参数,导致内存消耗和计算量剧增.针对上述问题,本文提出了一种改进2D CNN时空特征提取的动作识别方法,在2D CNN中嵌入时空门控和动作注意力聚合(Spatial-temporal Gate and Motion Attention-aggregation,SGMA)模块增强其时空特征提取能力.SGMA包含时空动态门控和动作注意力聚合两个子模块,时空动态门控能够可视化各通道特征的运动比例因子并依此逐通道分离运动强相关特征和运动弱相关特征,动作注意力聚合利用运动强相关特征构建金字塔结构来提取不同时间跨度的运动特征,并使用注意力机制自适应聚合各时间跨度特征实现长程时间建模,运动弱相关特征经过2D卷积提取空间特征后融合动作注意力聚合模块的输出最终获得强有力的时空特征表达.在相同帧采样策略下,本文方法在Something-SomethingV1&V2验证集上的Top1准确度比基准TSM分别提高了4.4%和6.2%.
There are two main ways of video action recognition based on deep learning to process temporal information.One is to use optical flow to represent the motion information between adjacent frames,but it cannot effectively model long-term temporal features.The other is to use 3D convolution to mixed model spatial and temporal feature,but it introduces a large number of parameters,resulting in the consumption surge of memory and calculation.For the above issues,we propose an action recognition method based on improving 2D CNN spatial-temporal feature extraction,which enhances the spatial-temporal feature extraction capacity by embedding Spatial-temporal Gate and Motion Attention-aggregation(SGMA)module in 2D CNN.SGMA contains two submodules of Spatial-temporal Dynamic Gate(SDG)and Motion Attention-aggregation(MAA).SDG can visualize the motion proportional factors of each channel feature and separate high motion-relate feature and less motion-relate feature.MAA uses high motion-relate feature to build pyramid structures to extract feature of different time spans,and then adaptively aggregates distinct time span features with attention mechanism to achieve long-term temporal modeling.The output of SGMA is a powerful spatial-temporal feature representation obtained by fusing the spatial features extracted by a 2D convolution on less motion-relate feature and the output of the MAA.Under the same frame sampling strategy,compared with the benchmark TSM,the TOP1 accuracy on Something-SomethingV1&V2 validation set of SGMA increased by 4.4%and 6.2%respectively.
作者
吉晨钟
次旺晋美
张伟
陈云芳
JI Chenzhong;CI Wangjinmei;ZHANG Wei;CHEN Yunfang(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2024年第1期168-176,共9页
Journal of Chinese Computer Systems
基金
国家重点研发计划项目(2019YFB2101700)资助.
关键词
视频动作识别
时空特征提取
注意力机制
长程时间建模
video action recognition
spatial-temporal feature extraction
attention mechanism
long-term temporal modeling