期刊文献+

融合语义信息和视觉推理特征的视频描述方法

Video Captioning Method Fusing Semantic Information and Visual Reasoning Features
下载PDF
导出
摘要 视频描述是一项同时涉及到计算机视觉和自然语言处理两个领域的跨模态任务,其目的是为视频自动生成一段描述,所生成的内容不仅要准确完整地描述视频的主要内容,而且要符合基本的语法结构.针对现有的视频描述方法在生成过程的可解释性和生成内容的准确性等方面尚存在一些不足之处,本文提出一种基于编解码框架的融合语义信息和视觉推理特征的视频描述方法,该方法在解码阶段进行适当的改进,提出3种特征融合网络,分别为特征参与的融合网络、特征引导的融合网络以及结合权重的融合网络,将视频对应的语义特征与视觉推理特征进行融合,从而生成兼具可解释性和准确性的描述.在MSVD和MSRVTT两个数据集上进行消融和对比实验的结果表明:与基模型相比,本文所提方法的CIDEr指标分别增长了21.6%和3.5%;与其他方法的比较结果表明,本文提出的方法在各个指标上具有一定的竞争力. Video captioning is a cross-modal task involving both computer vision and natural language processing.Its purpose is to automatically generate a description for the video.The generated content must not only accurately and completely describe the main content of the video,but also conform to the basic grammatical structure.Aiming at the shortcomings of the existing video captioning methods in the interpretability of the generation process and the accuracy of the generated content,this paper proposes a video captioning method based on the encoder-decoder framework that fuses semantic information and visual reasoning features.This method makes appropriate improvements in the decoding stage,and proposes three feature fusion networks to fuse the semantic features corresponding to the video with visual reasoning features,namely,a feature-involved fusion network,a feature-guided fusion network,and a weighted fusion network.The result is a description that is both interpretable and accurate.The results of ablation and comparison experiments on MSVD and MSRVTT datasets show that:compared with the base model,the CIDEr index of the proposed method has increased by 21.6% and 3.5%,respectively;the comparison with other methods shows that,the method proposed in this paper has certain competitiveness in each index.
作者 张浩萌 刘斌 ZHANG Haomeng;LIU Bin(College of Computer Science and Technology,Nanjing Tech University,Nanjing 211816,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2024年第2期470-476,共7页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61672279)资助。
关键词 视频描述 特征融合 视觉推理特征 语义特征 video captioning feature fusion visual reasoning feature semantic feature
  • 相关文献

参考文献1

共引文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部