摘要
基于视觉特征与文本特征融合的图像问答已经成为自动问答的热点研究方向之一。现有的大部分模型都是通过注意力机制来挖掘图像和问题语句之间的关联关系,忽略了图像区域和问题词在同一模态之中以及不同视角的关联关系。针对该问题,提出一种基于多路语义图网络的图像自动问答模型(MSGN),从多个角度挖掘图像和问题之间的语义关联。MSGN利用图神经网络模型挖掘图像区域和问题词细粒度的模态内模态间的关联关系,进而提高答案预测的准确性。模型在公开的图像问答数据集上的实验结果表明,从多个角度挖掘图像和问题之间的语义关联可提高图像问题答案预测的性能。
Recently,image question answering based on the fusion of visual features and text features has become one of the hot research issues of automatic question answering.Most of the existing models are based on the attention mechanism to explore the relationship between the image and the question sentence,which ignores the correlation between the image area and the question words in the same mode and different views.To solve these problems,this paper proposed an image question answering model(MSGN)based on multi-view semantic graph network,which could mine the semantic correlation between images and questions from multiple views.Meanwhile,it used the graph neural network model to mine the fine-grained intra and inter-modal correlation between image regions and question words.It carried out extensive experiments on public data sets.The experimental results show that the image automatic question answering model based on multi-view semantic graph network can improve the performance of image question answering.
作者
乔有田
张海军
路明
Qiao Youtian;Zhang Haijun;Lu Ming(School of Electronic Engineering,Yangzhou Polytechnic College,Yangzhou Jiangsu 225200,China;School of Information,Beijing Wuzi University,Beijing 101149,China;School of Cyber Science&Technology,Beihang University,Beijing 100191,China)
出处
《计算机应用研究》
CSCD
北大核心
2023年第2期383-387,共5页
Application Research of Computers
基金
北京市自然科学基金资助项目(4182037)
北京社会科学基金资助项目(21XCB005)
北京市教委科技计划资助项目(KM201810037001)。
关键词
图像问答
多头注意力
自动问答
特征融合
跨模态分析
image question answering
multi-head attention model
automatic question answering
feature fusion
cross-modal analysis