期刊文献+

基于多模态知识主动学习的视频问答方案

Video Question Answering Scheme Base on Multimodal Knowledge Active Learning
下载PDF
导出
摘要 视频问答是人工智能领域的一个热点研究问题.现有方法在特征提取方面缺乏针对视觉目标运动细节的获取,从而会导致错误因果关系的建立.此外,在数据融合与推理过程中,现有方法缺乏有效的主动学习能力,难以获取特征提取之外的先验知识,影响了模型对多模态内容的深度理解.针对这些问题,首先,设计了一种显性多模态特征提取模块,通过获取图像序列中视觉目标的语义关联以及与周围环境的动态关系来建立每个视觉目标的运动轨迹.进一步通过动态内容对静态内容的补充,为数据融合与推理提供了更加精准的视频特征表达.其次,提出了知识自增强多模态数据融合与推理模型,实现了多模态信息理解的自我完善和逻辑思维聚焦,增强了对多模态特征的深度理解,减少了对先验知识的依赖.最后,提出了一种基于多模态知识主动学习的视频问答方案.实验结果表明,该方案的性能优于现有最先进的视频问答算法,大量的消融和可视化实验也验证了方案的合理性. Video question answering requires models to understand,fuse,and reason about the multimodal data in videos to assist people in quickly retrieving,analyzing,and summarizing complex scenes in videos,becoming a hot research topic in artificial intelligence.However,existing methods lack abilities of obtaining the motion details of visual objects in feature extraction,which may lead to false causality.In addition,in data fusion and reasoning,existing methods lack effective active learning ability,making it difficult to obtain prior knowledge beyond feature extraction,which affects the model’s deep understanding of multimodal content.To address these issues,we propose a multimodal knowledge-based active learning video question answering solution.The solution acquires the semantic correlation of visual targets in image sequences and the dynamic relationship with the surrounding environment to establish the motion trajectory of each visual target.Further,static content is supplemented with dynamic content to provide more accurate video feature expression for data fusion and reasoning.Then,the solution achieves selfimprovement and logical thinking focus of multimodal information understanding through knowledge autoenhancement multimodal data fusion and reasoning model,filling the gap in deep understanding of multimodal content.Experimental results show that the performance of our scheme is better than the most advanced video question answering algorithm,and a large number of ablation and visualization experiments also verify the rationality of this solution.
作者 刘明阳 王若梅 周凡 林格 Liu Mingyang;Wang Ruomei;Zhou Fan;Lin Ge(National Engineering Research Center of Digital Life,School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006)
出处 《计算机研究与发展》 EI CSCD 北大核心 2024年第4期889-902,共14页 Journal of Computer Research and Development
基金 国家重点研发计划项目(2021YFF0900900)。
关键词 视频问答 数据融合与推理 多模态主动学习 视频细节描述提取 深度学习 video question answering data fusion and reasoning multimodal active learning video details description extraction deep learning
  • 相关文献

参考文献3

二级参考文献12

共引文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部