期刊文献+

一种基于多模态深度特征融合的视觉问答模型 被引量:3

Visual Question Answering Model Based on Multi-modal Deep Feature Fusion
下载PDF
导出
摘要 大数据时代,随着多源异构数据的爆炸式增长,多模态数据融合问题备受研究者的关注,其中视觉问答因需要图文协同处理而成为当前多模态数据融合研究的热点。视觉问答任务主要是对图像和文本两类模态数据进行特征关联与融合表示,最后进行推理学习给出结论。传统的视觉问答模型在特征融合时容易缺失模态关键信息,且大多数方法停留在数据之间浅层的特征关联表示学习,较少考虑深层的语义特征融合。针对上述问题,提出了一种基于图文特征跨模态深度交互的视觉问答模型。该模型利用卷积神经网络和长短时记忆网络分别获取图像和文本两种模态数据特征,然后利用元注意力单元组合建立的新型深度注意力学习网络,实现图文模态内部与模态之间的注意力特征交互式学习,最后对学习特征进行多模态融合表示并进行推理预测输出。在VQA-v2.0数据集上进行了模型实验和测试,结果表明,与基线模型相比,所提模型的性能有明显提升。 In the era of big data,with the explosive growth of multi-source heterogeneous data,multi-modal data fusion has attracted much attention of researchers,and visual question answering(VQA)has become a hot topic in multi-modal data fusion due to its image and text fusion processing characteristics.Visual Q&A task is mainly based on the deep feature fusion association and representation of image and text multi-modal data,and inference learning of the fusion feature results,so as to get the conclusion.Traditional visual question answering models tend to miss key information and mostly focus on the superficial modal feature association representation learning between data,but less on the deep semantic feature fusion.To solve the above pro-blems,this paper proposes a visual question answering model based on cross-modal deep interaction of of graphic features.The proposed method uses convolutional neural network and LSTM network to obtain the data features of image and text modes respectively,and builds a novel deep attention learning network based on combination of meta-attention units,to realize interactive learning of attention features within or between modes of image and text.At last,we represent the learning features so as to output the results.The model is tested and evaluated on VQA-v2.0 dataset.Compared with the traditional baseline model,the expe-rimental results show that the performance of the proposed model is significantly improved.
作者 邹芸竹 杜圣东 滕飞 李天瑞 ZOU Yunzhu;DU Shengdong;TENG Fei;LI Tianrui(Institute of Computer and Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China;National Engineering Laboratory of Integrated Transportation Big Data Application Technology,Chengdu 611756,China)
出处 《计算机科学》 CSCD 北大核心 2023年第2期123-129,共7页 Computer Science
基金 国家科技重大专项(2020AAA0105101)。
关键词 视觉问答 多模态特征融合 注意力机制 深度学习 数据融合 Visual question answering Multi-modal feature fusion Attention mechanism Deep learning Data fusion
  • 相关文献

参考文献14

二级参考文献14

共引文献81

同被引文献8

引证文献3

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部