摘要
同时使用语言和图像两种模态信息的神经网络模型在计算机视觉领域取得了很大进展.一些将其用于视频识别任务的工作,存在未考虑视频中丰富的时间-空间信息、用于描述类别的文本过于简单等不足.对此,本文提出了基于时空辅助信息监督的语言-视频对比学习模型.对于视频编码,提出了基于类别词元的时序加权位移模块进行时序建模,使得时序信息在网络从底层到高层的各个层次传播;而且还提出了时空信息辅助监督模块,深入挖掘视觉词元中蕴含的丰富时空信息.对于语言编码,提出了一种基于大语言模型的提示学习方法,对行为类别文本描述进行扩展,生成具有丰富上下文语义信息的文本描述.实验部分,本文提出的模型在4个视频行为识别数据集mini-Kinetics-200、Kinetics-400、UCF101和HMDB51上,达到了优于当前最先进方法或与当前最先进方法识别准确率相当的水平,比基线方法的识别准确率分别提升了2.5%、0.3%、0.6%和2.4%.
Video action recognition is one of the hot topics in the field of computer vision,which has attracted the attention of many researchers in recent decades.The basic method of video action recognition is widely used in Internet video audit,video surveillance,human-computer interaction and other fields.The main body of video is usually human.Because of the complexity and variability of human action categories and environment in real life,and the huge amount of video,it requires high computing devices,which brings great challenges to the task of video action recognition.In the field of video surveillance,most of the existing systems only record abnormal actions and cannot recognize it in real time,so they cannot realize real intelligence;while in the field of Internet video audit,a lot of manual audits is often needed,which can’t recognize human action in real time.Video can usually be regarded as images that change with time.This special image data contains rich information.To recognize actions from video,it is not only necessary to obtain the spatial information of the image at each moment,but also to capture the temporal reasoning information between frames,and more importantly,to obtain the spatio-temporal information.To this end,researchers have developed many network architectures for video action recognition tasks,which can be divided into the following four categories:two-stream convolutional neural networks(CNNs)based methods,3D CNNs based methods,2D convolutional network with spatio-temporal modeling module,and visual Transformer-based networks.The use of Transformer-based network models that integrate both language and image modalities has made great progress in the field of computer vision.There are three representative research works in computer vision tasks related to images:namely Contrastive Language-Image Pre-training(CLIP)model,A Large-scale Image and Noisy-text embedding(ALIGN)model and Florence model.However,when these models are applied to video recognition tasks,there are still some limitations that need to be addressed,such as the lack of consideration of rich spatio�temporal information in videos and the simplicity of text descriptions used to describe video categories,which results in insufficient contextual description ability.In this paper,we propose a language-video contrastive learning model based on spatio-temporal auxiliary information supervision.For video encoder,we propose a category token-based temporal weighted displacement module for temporal modeling,which enables temporal information to be propagated at various levels of the network from the bottom to the top.Furthermore,we propose a spatio�temporal information auxiliary supervision module to deeply explore the rich spatio-temporal information embedded in visual tokens.For language encoder,we propose a prompt learning method based on large-scale language pre-training models to extend action category text descriptions and generate text descriptions with rich contextual semantic information.The experiment has achieved better results than the current most advanced methods on four video action recognition datasets,namely,mini-Kinetics-200,Kinetics-400,UCF101,and HMDB51,and it is better than or comparable to the current state-of-the-art method,and the accuracy is 2.5%,0.3%,0.6%and 2.4%higher than the baseline,respectively.
作者
张冰冰
张建新
李培华
ZHANG Bing-Bing;ZHANG Jian-Xin;LI Pei-Hua(School of Computer Science and Engineering,Dalian Minzu University,Dalian,Liaoning 116650;School of Information and Communication Engineering,Dalian University of Technology,Dalian,Liaoning 116033)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2024年第8期1769-1785,共17页
Chinese Journal of Computers
基金
国家自然科学基金(61971086、61972062)
吉林省科技厅科技发展计划项目(20230201111GX)
辽宁省应用基础研究计划项目(2023JH2/101300191、2023JH2/101300193)资助。
关键词
行为识别
多模态模型
时序建模
时空信息辅助监督
提示学习
action recognition
multimodal model
temporal modeling
spatio-temporal information auxiliary supervision
prompt learning