TEAM:Transformer Encoder Attention Module for Video Classification

下载PDF

导出

摘要 Much like humans focus solely on object movement to understand actions,directing a deep learning model’s attention to the core contexts within videos is crucial for improving video comprehension.In the recent study,Video Masked Auto-Encoder(VideoMAE)employs a pre-training approach with a high ratio of tube masking and reconstruction,effectively mitigating spatial bias due to temporal redundancy in full video frames.This steers the model’s focus toward detailed temporal contexts.However,as the VideoMAE still relies on full video frames during the action recognition stage,it may exhibit a progressive shift in attention towards spatial contexts,deteriorating its ability to capture the main spatio-temporal contexts.To address this issue,we propose an attention-directing module named Transformer Encoder Attention Module(TEAM).This proposed module effectively directs the model’s attention to the core characteristics within each video,inherently mitigating spatial bias.The TEAM first figures out the core features among the overall extracted features from each video.After that,it discerns the specific parts of the video where those features are located,encouraging the model to focus more on these informative parts.Consequently,during the action recognition stage,the proposed TEAM effectively shifts the VideoMAE’s attention from spatial contexts towards the core spatio-temporal contexts.This attention-shift manner alleviates the spatial bias in the model and simultaneously enhances its ability to capture precise video contexts.We conduct extensive experiments to explore the optimal configuration that enables the TEAM to fulfill its intended design purpose and facilitates its seamless integration with the VideoMAE framework.The integrated model,i.e.,VideoMAE+TEAM,outperforms the existing VideoMAE by a significant margin on Something-Something-V2(71.3%vs.70.3%).Moreover,the qualitative comparisons demonstrate that the TEAM encourages the model to disregard insignificant features and focus more on the essential video features,capturing more detailed spatio-temporal contexts within the video.

作者 Hae Sung Park Yong Suk Choi

机构地区 Department of Artificial Intelligence Department of Computer Science

出处《Computer Systems Science & Engineering》 2024年第2期451-477,共27页 计算机系统科学与工程（英文）

基金 This work was supported by the National Research Foundation of Korea(NRF)Grant(Nos.2018R1A5A7059549,2020R1A2C1014037) by Institute of Information&Communications Technology Planning&Evaluation(IITP)Grant(No.2020-0-01373) funded by the Korea government(*MSIT).*Ministry of Science and Information&Communication Technology.

关键词 Video classification action recognition vision transformer masked auto-encoder

分类号 TN762 [电子电信—电路与系统]

引文网络
相关文献

1Pingping Wu,Ruihao Wang,Han Lin,Fanlong Zhang,Juan Tu,Miao Sun.Automatic depression recognition by intelligent speech signal processing:A systematic survey[J].CAAI Transactions on Intelligence Technology,2023,8(3):701-711.
2包立新.中华民族共同体视域下浅析额仑、满都海、孝庄三位蒙古族女性的共同人格特征[J].蒙古学研究（蒙文版）,2023(4):11-15.
3Abdelrahman Maharek,Amr Abozeid,Rasha Orban,Kamal ElDahshan.SwinVid:Enhancing Video Object Detection Using Swin Transformer[J].Computer Systems Science & Engineering,2024,48(2):305-320.
4卢得民,钟诚,杨锋.用于肺水肿量化的掩码图像-语言蒸馏模型[J].基因组学与应用生物学,2024,43(2):274-283.
5Du Zhanyuan.NARRATIVES BEYOND BORDERS[J].China Report ASEAN,2024,9(4):58-58.
6体例[J].Chinese Annals of History of Science and Technology,2023,7(2):137-140.
7系统综述或Meta分析研究方法的撰写要点[J].中华急危重症护理杂志,2024,5(4):358-358.
8Javier Soriano.The Teaching of “Formation Musicale” at a Distance: Concerns About Platformity, Cultural Text, and Musical Domains From the French Conservatory[J].US-China Education Review(A),2024,14(3):159-169.
9系统综述或Meta分析研究方法的撰写要点[J].中华护理杂志,2024,59(7):853-853.
10Wei Wei,Qingxuan Zeng,Yan Wang,Xixi Guo,Tianyun Fan,Yinghong Li,Hongbin Deng,Liping Zhao,Xintong Zhang,Yonghua Liu,Yulong Shi,Jingyang Zhu,Xican Ma,Yanxiang Wang,Jiandong Jiang,Danqing Song.Author correction to‘Discovery and identification of EIF2AK2 as a direct key target of berberine for anti-inflammatory effects’[Acta Pharmaceutica Sinica B 13(2023)2138-2151][J].Acta Pharmaceutica Sinica B,2024,14(3):1477-1477.

Computer Systems Science & Engineering

2024年第2期

浏览历史

内容加载中请稍等...

TEAM:Transformer Encoder Attention Module for Video Classification

相关作者

相关机构

相关主题

浏览历史