期刊文献+

基于潜在特征增强网络的视频描述生成方法

Video Description Generation Method Based on Latent Feature Augmented Network
下载PDF
导出
摘要 视频描述生成旨在用自然语言描述视频中的物体及其相互作用。现有方法未充分利用视频中的时空语义信息,限制了模型生成准确描述语句的能力。为此,提出一种用于视频描述生成的潜在特征增强网络(LFAN)模型。利用不同的特征提取器提取外观特征、运动特征和目标特征,将对象级的目标特征分别和帧级的外观特征与运动特征融合,同时对融合后的不同特征进行增强,在生成描述前利用图神经网络和长短时记忆网络推理对象之间的时空关系,从而得到具有时空信息和语义信息的潜在特征,同时使用长短时记忆网络和门控循环单元的解码器生成视频的描述语句。该网络模型能够准确地学习到对象特征,进而引导生成更准确的词汇及与对象之间的关系。在MSVD和MSR-VTT数据集上的实验结果表明,LFAN模型可以显著提高生成描述语句的准确性,并与视频中的内容呈现出更好的语义一致性,在MSVD数据集上的BLEU@4和ROUGE-L分数分别为57.0和74.1,在MSRVTT数据集上分别为43.8和62.1。 Video description generation aims to use natural language to describe objects and their interactions in videos.The existing methods do not fully utilize the spatio-temporal semantic information in videos,which limits the model's ability to generate accurate descriptive statements.To this end,a Latent Feature Augmented Network(LFAN)model is proposed for video description generation.Different feature extractors are used to extract appearance,motion,and target features,thereby fusing object level target features with frame level appearance and motion features.Concurrently,the fused different features are enhanced.Before generating descriptions,graph neural and long short-term memory networks are used to infer the spatio-temporal relationships between objects,thereby obtaining potential features with spatio-temporal and semantic information.Finally,a decoder using both a long short-term memory network and a gated loop unit is used to generate a description statement for the video.This network model can accurately learn object features and guide the generation of more accurate vocabulary and relationships with objects.The experimental results on MSVD and MSR-VTT datasets show that the LFAN model can significantly improve the accuracy of generating descriptive statements,exhibiting better semantic consistency with the content in the video.The BLEU@4 and ROUGE-L scores are 57.0 and 74.1 on MSVD,respectively,and 43.8 and 62.1 on the MSR-VTT dataset.
作者 李伟健 胡慧君 LI Weijian;HU Huijun(School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan 430065,Hubei,China)
出处 《计算机工程》 CAS CSCD 北大核心 2024年第2期266-272,共7页 Computer Engineering
基金 国家自然科学基金(62271359)。
关键词 视频描述生成 潜在特征增强网络 时空语义信息 图神经网络 特征融合 video description generation latent feature augmented network spatio-temporal semantic information graph neural networks feature fusion
  • 相关文献

参考文献3

二级参考文献7

共引文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部