摘要
视觉问答是一项涉及图像和文本的多模态任务,给定一个图像和一个用自然语言表达的问题,视觉问答系统需要对视觉和文本信息同时进行复杂的理解,提供关于图像的这个问题的准确答案。现有的视觉问答模型在获取与问题相关的图像区域时,不能有效利用文本与图像信息的多层次特征信息,因此,我们使用自注意记忆层,使得所得特征的每一层包含之前的先验知识。同时利用交叉记忆模块,在解码端的所有引导注意力层中输入编码端的各级加权特征,通过引导注意力,融合低层次与高层次信息,使用多层次信息更好地关注图像特征中的关键区域。本文在VQA v2.0数据集上进行了对比实验,表明该模型能充分利用图像和文本的多层次特征信息,与当前主流模型相比更具优越性。
Visual question answering (VQA) is a multimodal task involving images and text. Given an image and a question expressed in natural language, a visual question answering system needs to under-stand both visual and textual information in a complex way to provide an accurate answer to this question about the image. The existing visual question answering model cannot effectively utilize the multi-level feature information of text and image information when acquiring the image area related to the question. Therefore, we use the self-attention memory layer so that each layer of the obtained feature contains the previous prior knowledge. At the same time, the cross-memory module is used to input the weighted features of the encoder at all levels in all the guided attention layers of the decoder. By guided attention, the low-level and high-level information are fused, and the multi-level information is used to better focus on key areas in the image features. In this paper, we conduct comparative experiments on the VQA v2.0 dataset and show that the model can make full use of the multi-level feature information of images and text, and is superior to the current main-stream models.
出处
《计算机科学与应用》
2023年第6期1188-1198,共11页
Computer Science and Application