基于跨模态信息过滤的视觉问答网络

Cross-modal Information Filtering-based Networks for Visual Question Answering

下载PDF

导出

摘要视觉问答作为多模态任务,瓶颈在于需要解决不同模态间的融合问题,这不仅需要充分理解图像中的视觉和文本,还需具备对齐跨模态表示的能力。注意力机制的引入为多模态融合提供了有效的路径,然而先前的方法通常将提取的图像特征直接进行注意力计算,忽略了图像特征中含有噪声和不正确的信息这一问题,且多数方法局限于模态间的浅层交互,未曾考虑模态间的深层语义信息。为解决这一问题,提出了一个跨模态信息过滤网络,即首先以问题特征为监督信号,通过设计的信息过滤模块来过滤图像特征信息,使之更好地契合问题表征;随后将图像特征和问题特征送入跨模态交互层,在自注意力和引导注意力的作用下分别建模模态内和模态间的关系,以获取更细粒度的多模态特征。在VQA2.0数据集上进行了广泛的实验,实验结果表明,信息过滤模块的引入有效提升了模型准确率,在test-std上的整体精度达到了71.51%,相比大多数先进的方法具有良好的性能。 As a multi-modal task,the bottleneck of visual question answering(VQA)is to solve the problem of fusion between different modes.It requires not only a full understanding of vision and text in the image,but also the ability to align cross-modal representation.The introduction of the attention mechanism provides an effective path for multi-mode fusion.However,the pre-vious methods usually calculate the extracted image features directly,ignoring the noise and incorrect information contained in the image features,and most of the methods are limited to the shallow interaction between modes,without considering the deep semantic information between modes.To solve this problem,a cross-modal information filtering network(CIFN)is proposed.Firstly,the feature of the problem is taken as the supervision signal,and the information filtering module is designed to filter the feature information of the image,so that it can better fit the representation of the problem.Then the image features and problem features are sent to the cross-modal interaction layer,and the intra-modal and inter-modal relationships are modeled respectively under the action of self-attention and guided attention,so as to obtain more fine-grained multi-modal features.Extensive experiments have been conducted on VQA2.0 data sets,and the experimental results show that the introduction of information filtering mo-dule effectively improves the model accuracy,and the overall accuracy of test-std reaches 71.51%,which has good performance compared with the most advanced methods.

作者何世阳王朝晖龚声蓉钟珊 HE Shiyang;WANG Zhaohui;GONG Shengrong;ZHONG Shan(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215008,China;Soochow College,Soochow University,Suzhou,Jiangsu 215006,China;School of Computer Science and Engineering,Changshu Institute of Technology,Suzhou,Jiangsu 215500,China)

机构地区苏州大学计算机科学与技术学院苏州大学东吴学院常熟理工学院计算机科学与工程学院

出处《计算机科学》 CSCD 北大核心 2024年第5期85-91,共7页 Computer Science

基金国家自然科学基金(61972059,42071438) 江苏省自然科学基金(BK20191474,BK20191475) 吉林大学符号计算与知识工程教育部重点实验室(93K172021K01)。

关键词视觉问答深度学习注意力机制多模态融合信息过滤 Visual question answering Deep learning Attention mechanism Multi-modal fusion Information filtering

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1付鹏程,杨关,刘小明,刘阳,张紫明,成曦.基于空间关系与频率特征的视觉问答模型[J].计算机工程,2022,48(9):96-104. 被引量：4

二级参考文献2

1顾砾,季怡,刘纯平.基于多模态特征融合的三维点云分类方法[J].计算机工程,2021,47(2):279-284. 被引量：17
2施政,毛力,孙俊.基于YOLO的多模态加权融合行人检测算法[J].计算机工程,2021,47(8):234-242. 被引量：11

共引文献3

1袁德森,刘修敬,吴庆波,李宏亮,孟凡满,颜庆义,许林峰.基于反事实思考的视觉问答方法[J].计算机科学,2022,49(12):229-235. 被引量：1
2邹芸竹,杜圣东,滕飞,李天瑞.一种基于多模态深度特征融合的视觉问答模型[J].计算机科学,2023,50(2):123-129. 被引量：3
3吴金蔓,车进,白雪冰,陈玉敏.融合视觉定位信息的视觉问答算法研究[J].长江信息通信,2024,37(5):1-4.

1倪守娟,颜艳,初元鸽.基于人工智能技术的电子信息交互系统优化设计[J].通信电源技术,2024,41(4):16-18.
2陈宏,钱胜胜,李章明,方全,徐常胜.基于多模态掩码Transformer网络的社会事件分类[J].北京航空航天大学学报,2024,50(2):579-587.
3李先静,陈颖,石艳娇.基于空间注意力与过滤网络的遥感图像配准[J].计算机仿真,2024,41(2):202-206.
4古云豪,龚勋,周鸿.面向乳腺超声的跨模态注意力网络[J].人工智能科学与工程,2023(9):48-56.
5杨力,钟俊弘,张赟,宋欣渝.基于复合跨模态交互网络的时序多模态情感分析[J].计算机科学与探索,2024,18(5):1318-1327.
6齐宁.基于深度学习的中文实体关系抽取研究[J].长江信息通信,2024,37(1):64-66.
7付顺旺,陈茜,李智,王国美,卢妤.用于篡改图像检测和定位的双通道渐进式特征过滤网络[J].计算机应用,2024,44(4):1303-1309.
8杨旸,王海洋,彭睿,贾文龙,朱浩宇,李长俊.国家管网智慧管网标准体系三维业务框架设计[J].油气储运,2024,43(4):473-479.

计算机科学

2024年第5期

浏览历史

内容加载中请稍等...

基于跨模态信息过滤的视觉问答网络

参考文献1

二级参考文献2

共引文献3

相关作者

相关机构

相关主题

浏览历史