摘要
现有视频内容描述模型生成的视频内容描述文本可读性差且准确率不高。基于ViT模型提出一种语义引导的视频内容描述方法。利用ReNeXt和ECO网络提取视频的视觉特征,以提取的视觉特征为输入、语义标签的概率预测值为输出训练语义检测网络(SDN)。在此基础上,通过ViT模型对静态和动态视觉特征进行全局编码,并与SDN提取的语义特征进行注意力融合,采用语义长短期记忆网络对融合特征进行解码,生成视频对应的描述文本。通过引入视频中的语义特征能够引导模型生成更符合人类习惯的描述,使生成的描述更具可读性。在MSR-VTT数据集上的测试结果表明,该模型的BLEU-4、METEOR、ROUGE-L和CIDEr指标分别为44.8、28.9、62.8和51.1,相比于当前主流的视频内容描述模型ADL和SBAT,提升的得分总和达到16.6和16.8。
This paper proposes a video captioning method based on Vision Transformer(ViT)and semantic guidance to alleviate the problems of poor readability and low accuracy of caption text generated by exsisting video content captioning models.First,the visual features of the video are extracted by ReNeXt and Efficient Convolutional Network(ECO).Second,the Semantic Detection Network(SDN)is trained using the extracted visual features as input and the probability prediction value of semantic label as output.Third,the static and dynamic visual features are globally encoded by ViT,and fused with the semantic features extracted by SDN.Finally,the fused features are decoded by the semantic Long Short-Term Memory(LSTM)network to generate the corresponding caption text of the video.Experimental results show that the introduction of semantic features in videos can guide the model to generate caption that are more in line with human habits,and the generated caption are more readable.The test results on the MSR-VTT dataset show that the BLEU-4,METEOR,ROUGE-L,and CIDER indicators of the model are 44.8,28.9,62.8,and 51.1,respectively.Compared with the current mainstream video content captioning models ADL and SBAT,the total scores on the four indicators increase by 16.6 and 16.8.
作者
赵宏
陈志文
郭岚
安冬
ZHAO Hong;CHEN Zhiwen;GUO Lan;AN Dong(College of Computer and Communication,Lanzhou University of Technology,Lanzhou 730050,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2023年第5期247-254,共8页
Computer Engineering
基金
国家自然科学基金“基于深度学习的广谱恶意域名检测方法研究”(62166025)
甘肃省重点研发计划“监控视频内容理解和描述文本生成以及在重点行业的示范应用”(21YF5GA073)。
关键词
视频内容描述
视频理解
ViT模型
语义引导
长短期记忆网络
注意力机制
video content caption
video understanding
Vision Transformer(ViT)model
semantic guidance
Long Short-Term Memory(LSTM)network
attention mechanism