融合语义信息和视觉推理特征的视频描述方法

Video Captioning Method Fusing Semantic Information and Visual Reasoning Features

下载PDF

导出

摘要视频描述是一项同时涉及到计算机视觉和自然语言处理两个领域的跨模态任务,其目的是为视频自动生成一段描述,所生成的内容不仅要准确完整地描述视频的主要内容,而且要符合基本的语法结构.针对现有的视频描述方法在生成过程的可解释性和生成内容的准确性等方面尚存在一些不足之处,本文提出一种基于编解码框架的融合语义信息和视觉推理特征的视频描述方法,该方法在解码阶段进行适当的改进,提出3种特征融合网络,分别为特征参与的融合网络、特征引导的融合网络以及结合权重的融合网络,将视频对应的语义特征与视觉推理特征进行融合,从而生成兼具可解释性和准确性的描述.在MSVD和MSRVTT两个数据集上进行消融和对比实验的结果表明:与基模型相比,本文所提方法的CIDEr指标分别增长了21.6%和3.5%;与其他方法的比较结果表明,本文提出的方法在各个指标上具有一定的竞争力. Video captioning is a cross-modal task involving both computer vision and natural language processing.Its purpose is to automatically generate a description for the video.The generated content must not only accurately and completely describe the main content of the video,but also conform to the basic grammatical structure.Aiming at the shortcomings of the existing video captioning methods in the interpretability of the generation process and the accuracy of the generated content,this paper proposes a video captioning method based on the encoder-decoder framework that fuses semantic information and visual reasoning features.This method makes appropriate improvements in the decoding stage,and proposes three feature fusion networks to fuse the semantic features corresponding to the video with visual reasoning features,namely,a feature-involved fusion network,a feature-guided fusion network,and a weighted fusion network.The result is a description that is both interpretable and accurate.The results of ablation and comparison experiments on MSVD and MSRVTT datasets show that:compared with the base model,the CIDEr index of the proposed method has increased by 21.6% and 3.5%,respectively;the comparison with other methods shows that,the method proposed in this paper has certain competitiveness in each index.

作者张浩萌刘斌 ZHANG Haomeng;LIU Bin(College of Computer Science and Technology,Nanjing Tech University,Nanjing 211816,China)

机构地区南京工业大学计算机科学与技术学院

出处《小型微型计算机系统》 CSCD 北大核心 2024年第2期470-476,共7页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(61672279)资助。

关键词视频描述特征融合视觉推理特征语义特征 video captioning feature fusion visual reasoning feature semantic feature

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1苗教伟,季怡,刘纯平.基于视觉特征引导融合的视频描述方法[J].计算机工程与应用,2022,58(20):124-131. 被引量：3

共引文献2

1李铭兴,徐成,李学伟,刘宏哲,闫晨阳,廖文森.基于多模态融合的城市道路场景视频描述模型研究[J].计算机应用研究,2023,40(2):607-611. 被引量：5
2姜良,张程,魏德健,曹慧,杜昱峥.深度学习在骨质疏松辅助诊断中的应用[J].计算机工程与应用,2024,60(7):26-40. 被引量：1

1古冰,邓勇,魏奇锋.经济学类与工学类课程思政教学的比较[J].西部素质教育,2024,10(4):48-51.
2石佳豪,姚莉.基于语义引导的视频描述生成[J].图学学报,2023,44(6):1191-1201.
3朱慧斌,何章鸣,王炯琦,王宇昂,周海银.基于MSVD-AE的航天器电源系统故障检测方法[J].空间控制技术与应用,2023,49(5):80-88. 被引量：1
4任剑洪,曾勍炜,李向军,龚政,刘方.融合语义增强与多注意力机制的视频描述方法[J].南昌大学学报（理科版）,2023,47(6):548-555.
5王亮.基于倾斜影像与点云数据的古建筑三维模型构建[J].城市勘测,2024(1):103-108. 被引量：3
6李伟健,胡慧君.基于潜在特征增强网络的视频描述生成方法[J].计算机工程,2024,50(2):266-272.
7李立.小学数学教学中高阶思维能力的培养[J].中文科技期刊数据库（引文版）教育科学,2024(2):0133-0136.
8周松青,曹宗芳.来曲唑联合二甲双胍治疗多囊卵巢综合征不孕的临床效果[J].中国妇幼保健,2024,39(2):285-288. 被引量：4
9于欣初.小学体育教学中游戏化教育的应用研究[J].中文科技期刊数据库（引文版）教育科学,2024(2):0189-0192.
10李冠彬,张锐斐,刘梦梦,刘劲,林倞.语言结构引导的可解释视频语义描述[J].软件学报,2023,34(12):5905-5920.

小型微型计算机系统

2024年第2期

浏览历史

内容加载中请稍等...

融合语义信息和视觉推理特征的视频描述方法

参考文献1

共引文献2

相关作者

相关机构

相关主题

浏览历史