摘要
视觉问答作为多模态任务,瓶颈在于需要解决不同模态间的融合问题,这不仅需要充分理解图像中的视觉和文本,还需具备对齐跨模态表示的能力。注意力机制的引入为多模态融合提供了有效的路径,然而先前的方法通常将提取的图像特征直接进行注意力计算,忽略了图像特征中含有噪声和不正确的信息这一问题,且多数方法局限于模态间的浅层交互,未曾考虑模态间的深层语义信息。为解决这一问题,提出了一个跨模态信息过滤网络,即首先以问题特征为监督信号,通过设计的信息过滤模块来过滤图像特征信息,使之更好地契合问题表征;随后将图像特征和问题特征送入跨模态交互层,在自注意力和引导注意力的作用下分别建模模态内和模态间的关系,以获取更细粒度的多模态特征。在VQA2.0数据集上进行了广泛的实验,实验结果表明,信息过滤模块的引入有效提升了模型准确率,在test-std上的整体精度达到了71.51%,相比大多数先进的方法具有良好的性能。
As a multi-modal task,the bottleneck of visual question answering(VQA)is to solve the problem of fusion between different modes.It requires not only a full understanding of vision and text in the image,but also the ability to align cross-modal representation.The introduction of the attention mechanism provides an effective path for multi-mode fusion.However,the pre-vious methods usually calculate the extracted image features directly,ignoring the noise and incorrect information contained in the image features,and most of the methods are limited to the shallow interaction between modes,without considering the deep semantic information between modes.To solve this problem,a cross-modal information filtering network(CIFN)is proposed.Firstly,the feature of the problem is taken as the supervision signal,and the information filtering module is designed to filter the feature information of the image,so that it can better fit the representation of the problem.Then the image features and problem features are sent to the cross-modal interaction layer,and the intra-modal and inter-modal relationships are modeled respectively under the action of self-attention and guided attention,so as to obtain more fine-grained multi-modal features.Extensive experiments have been conducted on VQA2.0 data sets,and the experimental results show that the introduction of information filtering mo-dule effectively improves the model accuracy,and the overall accuracy of test-std reaches 71.51%,which has good performance compared with the most advanced methods.
作者
何世阳
王朝晖
龚声蓉
钟珊
HE Shiyang;WANG Zhaohui;GONG Shengrong;ZHONG Shan(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215008,China;Soochow College,Soochow University,Suzhou,Jiangsu 215006,China;School of Computer Science and Engineering,Changshu Institute of Technology,Suzhou,Jiangsu 215500,China)
出处
《计算机科学》
CSCD
北大核心
2024年第5期85-91,共7页
Computer Science
基金
国家自然科学基金(61972059,42071438)
江苏省自然科学基金(BK20191474,BK20191475)
吉林大学符号计算与知识工程教育部重点实验室(93K172021K01)。
关键词
视觉问答
深度学习
注意力机制
多模态融合
信息过滤
Visual question answering
Deep learning
Attention mechanism
Multi-modal fusion
Information filtering