问题引导的空间关系图推理视觉问答模型被引量：4

Question-guided spatial relation graph reasoning model for visual question answering

导出

摘要目的现有视觉问答模型的研究主要从注意力机制和多模态融合角度出发,未能对图像场景中对象之间的语义联系显式建模,且较少突出对象的空间位置关系,导致空间关系推理能力欠佳。对此,本文针对需要空间关系推理的视觉问答问题,提出利用视觉对象之间空间关系属性结构化建模图像,构建问题引导的空间关系图推理视觉问答模型。方法利用显著性注意力,用Faster R-CNN(region-based convolutional neural network)提取图像中显著的视觉对象和视觉特征;对图像中的视觉对象及其空间关系结构化建模为空间关系图;利用问题引导的聚焦式注意力进行基于问题的空间关系推理。聚焦式注意力分为节点注意力和边注意力,分别用于发现与问题相关的视觉对象和空间关系;利用节点注意力和边注意力权重构造门控图推理网络,通过门控图推理网络的信息传递机制和控制特征信息的聚合,获得节点的深度交互信息,学习得到具有空间感知的视觉特征表示,达到基于问题的空间关系推理;将具有空间关系感知的图像特征和问题特征进行多模态融合,预测出正确答案。结果模型在VQA(visual question answering)v2数据集上进行训练、验证和测试。实验结果表明,本文模型相比于Prior、Language only、MCB(multimodal compact bilinear)、ReasonNet和Bottom-Up等模型,在各项准确率方面有明显提升。相比于Reason Net模型,本文模型总体的回答准确率提升2.73%,是否问题准确率提升4.41%,计数问题准确率提升5.37%,其他问题准确率提升0.65%。本文还进行了消融实验,验证了方法的有效性。结论提出的问题引导的空间关系图推理视觉问答模型能够较好地将问题文本信息和图像目标区域及对象关系进行匹配,特别是对于需要空间关系推理的问题,模型展现出较强的推理能力。 ObjectiveCurrent visual question answering(VQA)methods are mostly based on attention mechanism and multimodal fusion.Deep learning have intensively promoted computer vision and natural language processing(NLP)both.Interdisciplinary area between language and vision like VQA has been focused on.VQA is composed of an AI-completed task and it yields a proxy to evaluate our progress towards artificial intelligence(AI)-based quick response reasoning.A VQA based model needs to fully understand the visual scene of the image,especially the interaction between multiple objects.This task inherently requires visual reasoning beyond the relationships between the image objects.MethodOur question-guided spatial relationship graph reasoning(QG-SRGR)model is demonstrated in order to solve the issue of spatial relationship reasoning in VQA,which uses the inherent spatial relationship properties between image objects.First,saliency-based attention mechanism is used in our model,the salient visual objects and visual features are extracted by using faster region-based convolutional neural network(Faster R-CNN).Next,the visual objects and their spatial relationships are structured as a spatial relation graph.The visual objects in the image are defined as vertices of spatial relation graph,and the edges of the graph are dynamically constructed by the inherently spatial relation between the visual objects.Then,question-guided focused attention is used to conduct question-based spatial relation reasoning.Focused attention is divided into node attention and edge attention.Node attention is used to find the most relevant visual objects to the question,and edge attention is used to discover the spatial relation that most relevant to the question.Furthermore,the gated graph reasoning network(GGRN)is constructed based on the node attention weights and the edge attention weights,and the features of the neighbor nodes are aggregated by GGRN.Therefore,the deep interaction information between nodes can be obtained,the visual feature representation with spatial perception can be learned,and the question-based spatial relationship reasoning can also be achieved.Finally,the image features with spatial relation-aware and question features are fused to predict the right answer.ResultOur QG-SRGR model is trained,validated and tested on the VQA v2.0 dataset.The results illustrate that the overall accuracy is 66.43%on the Test-dev set,where the accuracy of answering“Yes”or“No”questions is 83.58%,the accuracy of answering counting questions is 45.61%,and the accuracy of answering other questions types is 56.62%.The Test-std set based accuracies calculated are 66.65%,83.86%,45.36%and 56.93%,respectively.QG-SRGR model improves the average accuracy achieved by the Reason Net model by 2.73%,4.41%,5.37%and 0.65%respectively on the overall,Yes/No,counting and other questions beyond the Test-std set.In addition,the ablation experiments are carried out on validation set.The results of ablation experiments verify the effectiveness of our method.ConclusionOur proposed QG-SRGR model can better match the text information of the question with the image target regions and the spatial relationships of objects,especially for the spatial relationship reasoning oriented questions.Our illustrated QG-SRGR model demonstrates its priority on reasoning ability.

作者兰红张蒲芬 Lan Hong;Zhang Pufen(School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China)

机构地区江西理工大学信息工程学院

出处《中国图象图形学报》 CSCD 北大核心 2022年第7期2274-2286,共13页 Journal of Image and Graphics

基金国家自然科学基金项目(61762046) 江西省自然科学基金项目(20161BAB212048)。

关键词视觉问答(VQA) 图卷积神经网络(GCN) 注意力机制空间关系推理多模态学习 visual question answering(VQA) graph convolution neural network(GCN) attention mechanism spatial relation reasoning multimodal learning

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1闫茹玉,刘学亮.结合自底向上注意力机制和记忆网络的视觉问答模型[J].中国图象图形学报,2020,25(5):993-1006. 被引量：14

二级参考文献1

1周远侠,于津.基于深度学习的图片问答系统设计研究[J].计算机应用与软件,2018,35(12):199-208. 被引量：4

共引文献13

1孙广路,吴猛,邱景,梁丽丽.针对长视频问答的深度记忆融合模型[J].哈尔滨理工大学学报,2021,26(1):1-8. 被引量：1
2邱南,顾玉宛,石林,李宁,庄丽华,徐守坤.基于复合图文特征的视觉问答模型研究[J].计算机应用研究,2021,38(8):2293-2298.
3张伟.基于关系感知双重注意力融合的视觉问答技术[J].南京工程学院学报（自然科学版）,2021,19(3):80-84.
4邹品荣,肖锋,张文娟,张万玉,王晨阳.面向视觉问答的多模块协同注意模型[J].计算机工程,2022,48(2):250-260. 被引量：6
5陈巧红,漏杨波,孙麒,贾宇波.基于多模态门控自注意力机制的视觉问答模型[J].浙江理工大学学报（自然科学版）,2022,47(3):413-423. 被引量：2
6丁凯旋,陈雁翔,赵鹏铖,朱玉鹏,盛振涛.多负例对比机制下的跨模态表示学习[J].计算机工程与应用,2022,58(19):184-192. 被引量：1
7邹品荣,肖锋,张文娟,黄姝娟,张万玉.融合场景语义与空间关系的视觉问答[J].西安工业大学学报,2023,43(1):56-65. 被引量：1
8邹芸竹,杜圣东,滕飞,李天瑞.一种基于多模态深度特征融合的视觉问答模型[J].计算机科学,2023,50(2):123-129. 被引量：3
9黎颖,吴清锋,刘佳桐,邹嘉龙.引导性权重驱动的图表问答重定位关系网络[J].中国图象图形学报,2023,28(2):510-521. 被引量：1
10张岱松,盛文婷,谷峥,刘静.基于多模块深度神经网络的陶瓷图像视觉问答方法[J].南京理工大学学报,2023,47(2):192-198.

同被引文献6

1白林亭,文鹏程,李亚晖.基于深度学习的视觉问答技术研究[J].航空计算技术,2018,48(5):334-338. 被引量：8
2闫茹玉,刘学亮.结合自底向上注意力机制和记忆网络的视觉问答模型[J].中国图象图形学报,2020,25(5):993-1006. 被引量：14
3陈婷,王玉德,任志伟.基于问题增强的问题引导图像视觉问答算法[J].通信技术,2022,55(2):166-173. 被引量：1
4邹品荣,肖锋,张文娟,黄姝娟,张万玉.融合场景语义与空间关系的视觉问答[J].西安工业大学学报,2023,43(1):56-65. 被引量：1
5邹芸竹,杜圣东,滕飞,李天瑞.一种基于多模态深度特征融合的视觉问答模型[J].计算机科学,2023,50(2):123-129. 被引量：3
6张昊雨,张德.基于图结构的级联注意力视觉问答模型[J].计算机工程与应用,2023,59(6):155-161. 被引量：1

引证文献4

1邹品荣,肖锋,张文娟,黄姝娟,张万玉.融合场景语义与空间关系的视觉问答[J].西安工业大学学报,2023,43(1):56-65. 被引量：1
2刘传.基于门控图卷积网络和协同注意力的视觉问答[J].计算机与数字工程,2023,51(4):860-865. 被引量：1
3张一飞,孟春运,蒋洲,栾力,Ernest Domanaanmwi Ganaa.可解释的视觉问答研究进展[J].计算机应用研究,2024,41(1):10-20. 被引量：1
4胡婷,何利力.基于门控机制的联合关系推理视觉问答模型[J].智能计算机与应用,2023,13(12):138-143.

二级引证文献3

1张一飞,孟春运,蒋洲,栾力,Ernest Domanaanmwi Ganaa.可解释的视觉问答研究进展[J].计算机应用研究,2024,41(1):10-20. 被引量：1
2唐蕾,牛园园,王瑞杰,行本贝,王一婷.强化学习的可解释方法分类研究[J].计算机应用研究,2024,41(6):1601-1609. 被引量：2
3原蕾,王科俊.利用人工智能神经网络体系结构生成视觉问答系统中的自然语言解释[J].西南大学学报（自然科学版）,2024,46(10):212-221.

1赖耀基.问题引导在初中化学实验教学中的研究与实践[J].数理化解题研究,2022(23):125-127.
2薛阳,雷文平,岳帅旭,徐向阳,王坤.多模态学习方法在滚动轴承故障诊断中的应用[J].机械科学与技术,2022,41(8):1149-1153. 被引量：5
3刘敏,王大维,张旭.基于SLP方法的某支线客机尾锥生产线产能爬坡规划[J].航空制造技术,2022,65(3):101-107. 被引量：2
4付鹏程,杨关,刘小明,刘阳,张紫明,成曦.基于空间关系与频率特征的视觉问答模型[J].计算机工程,2022,48(9):96-104. 被引量：4
5靳州,杨振舰.基于关系时间嵌入的时间知识表示学习[J].天津城建大学学报,2022,28(4):297-301.
6窦琦.浅谈农村小学语文高效课堂教学模式的构建[J].新作文（教研）,2022(9):0223-0225.
7夏秀坤,张曼琳.预训练语言模型在科学类QA方向的探索研究——基于ARC数据集[J].河北软件职业技术学院学报,2022,24(3):1-5.
8郑颖.初中语文课堂教学中开放性问题的设计研究[J].课程教学研究,2022(3):59-62.
9王娟.大学生英语学习动机“自我系统”建构的有效性评估研究[J].佳木斯职业学院学报,2022,38(9):64-66. 被引量：1
10王小鹏,李丹.结合属性信息与对偶注意力的实体对齐关系感知邻域匹配模型[J].江汉大学学报（自然科学版）,2022,50(4):75-86.

中国图象图形学报

2022年第7期

浏览历史

内容加载中请稍等...

问题引导的空间关系图推理视觉问答模型被引量：4

参考文献1

二级参考文献1

共引文献13

同被引文献6

引证文献4

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

问题引导的空间关系图推理视觉问答模型 被引量：4

参考文献1

二级参考文献1

共引文献13

同被引文献6

引证文献4

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

问题引导的空间关系图推理视觉问答模型被引量：4