期刊文献+

利用图像描述与知识图谱增强表示的视觉问答 被引量:4

Exploiting image captions and external knowledge as representation enhancement for VQA
原文传递
导出
摘要 视觉问答作为多模态任务,需要深度理解图像和文本问题从而推理出答案。然而在许多情况下,仅在图像和问题上进行简单推理难以得到正确的答案,事实上还有其他有效的信息如图像描述、外部知识等可以被利用。该文提出了利用图像描述和外部知识增强表示的视觉问答模型。该模型以问题为导向,基于协同注意力机制分别在图像和其描述上进行编码,并且利用知识图谱嵌入,将外部知识编码到模型当中,丰富了模型的特征表示,增强了模型的推理能力。在OKVQA数据集上的实验结果表明,该方法相比基线方法有1.71%的准确率提升,与已有的主流模型相比也有1.88%的准确率提升,证明了该方法的有效性。 As a multimodal task, visual question answering(VQA) requires a comprehensive understanding of images and questions. However, conducting reasoning simply on images and questions may fail in some cases. Other information that can be used for the task, such as image captions and external knowledge base, exists. A novel approach is proposed in this paper to incorporate information on image captions and external knowledge into VQA models. The proposed approach adopts the co-attention mechanism and encodes image captions with the guidance from the question to utilize image captions. Moreover, the approach incorporates external knowledge by using knowledge graph embedding as the initialization of word embeddings. The above methods enrich the capability of feature representation and model reasoning. Experimental results on the OKVQA dataset show that the proposed method achieves an improvement of 1.71% and 1.88% over the baseline and best-reported previous systems, respectively, which proved the effectiveness of this method.
作者 王屹超 朱慕华 许晨 张琰 王会珍 朱靖波 WANG Yichao;ZHU Muhua;XU Chen;ZHANG Yan;WANG Huizhen;ZHU Jingbo(Natural Language Processing Lab,School of Computer Science and Engineering,Northeastern University,Shenyang 110000,China)
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2022年第5期900-907,共8页 Journal of Tsinghua University(Science and Technology)
基金 国家自然科学基金重点项目(61732005) 国家自然科学基金面上项目(61876035)。
关键词 视觉问答 多模态融合 知识图谱 图像描述 visual question answering multimodal fusion knowledge graph image captioning
  • 相关文献

同被引文献14

引证文献4

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部