引导性权重驱动的图表问答重定位关系网络被引量：1

Leading weight-driven re-position relation network for figure question answering

导出

摘要目的图表问答是计算机视觉多模态学习的一项重要研究任务,传统关系网络(relation network,RN)模型简单的两两配对方法可以包含所有像素之间的关系,因此取得了不错的结果,但此方法不仅包含冗余信息,而且平方式增长的关系配对的特征数量会给后续的推理网络在计算量和参数量上带来很大的负担。针对这个问题,提出了一种基于融合语义特征提取的引导性权重驱动的重定位关系网络模型来改善不足。方法首先通过融合场景任务的低级和高级图像特征来提取更丰富的统计图语义信息,同时提出了一种基于注意力机制的文本编码器,实现融合语义的特征提取,然后对引导性权重进行排序进一步重构图像的位置,从而构建了重定位的关系网络模型。结果在2个数据集上进行实验比较,在FigureQA(an annotated figure dataset for visual reasoning)数据集中,相较于IMG+QUES(image+questions)、RN和ARN(appearance and relation networks),本文方法的整体准确率分别提升了26.4%,8.1%,0.46%,在单一验证集上,相较于LEAF-Net(locate,encode and attend for figure network)和FigureNet,本文方法的准确率提升了2.3%,2.0%;在DVQA(understanding data visualization via question answering)数据集上,对于不使用OCR(optical character recognition)方法,相较于SANDY(san with dynamic encoding model)、ARN和RN,整体准确率分别提升了8.6%,0.12%,2.13%;对于有Oracle版本,相较于SANDY、LEAF-Net和RN,整体准确率分别提升了23.3%,7.09%,4.8%。结论本文算法围绕图表问答任务,在DVQA和FigureQA两个开源数据集上分别提升了准确率。 Objective Figure-based question and answer(Q&A)is focused on learning the basic information representation of data mining in real scenes and provide the basis of judgment for reasoning in terms of the text information of the joint questions.It is widely used for multi-modal learning tasks.Existing methods can be segmented into two categories in common:1)model tasks are based on neural network framework algorithms directly.The statistical graph is processed by the convolutional neural network to obtain the feature map of the image information,the question text is encoded by the recurrent neural network to obtain the sentence-level embedding representation vector.The output answer is obtained by the fusion inference model.To capture the overall representation of the fusion of multi-modal feature information,the popular attention mechanism is concerned about the obtained image feature matrix as the input of the text encoder in recent years.However,the interaction between the relationship features in the multi-modal scene has a huge negative impact on the extraction of effective semantic features.2)A multi-module framework algorithm is used to decompose the task into multiple steps.Different modules are used to obtain the feature information at first,the obtained information is then used as the input of the subsequent modules,and the final output results are obtained through the subsequent algorithm modules.However,this type of method needs to rely on additional annotation information to train individual modules,and the complexity is quite higher.So,we develop a weight-driven re-located relational network model based on fusion semantic feature extraction.Method We clarify the whole framework for weight-driven re-located relation network,which consists of three modules in the context of image feature extraction,the attention-based long short-term memory(LSTM)and joint weight-driven re-located relation network.1)For the image feature extraction module,image feature extraction is implemented via fusing the convolutional layer and the up-sampling layer.To make the extracted image feature information more suitable for the scene task,we design a fusion of convolutional neural network and U-Net network architecture to construct a network model that can extract the semantic meaning of low-level and high-level image features.2)For the attention-based LSTM module,we joint the problem-based reasoning feature representation in terms of attention mechanism.LSTM can just retain the influence of existing words on unrecognized words.To obtain a better vector representation of the sentence,we can capture different contextual information based on attention mechanism.3)For the joint leading weight-driven re-located relation network module,we propose a paired matching mechanism,which guides the matching process of relationship features in the relationship network.That is to calculate the inner product of the feature vector of each pixel with the feature vectors of all the pixels,the similarity can be obtained between it and all the points and the pixel can be obtained by averaging in the entire group at the end.However,to resolve the high complexity problem,it ignore the overall relationship balance that can be obtained by the original pairwise pairing method although the relationship features matching pair sequence obtained by the above method.Therefore,our re-located operation is carried out to achieve a balanced effect for overall relationship.1)Remove the relationship feature of the pixel paired with itself from the obtained relationship feature pairset;2)swap locations in the relationship feature list of each pixel according to a constant one exchange and this iterative rule;and 3)add the location information of the pixels and the sentence-level embedding.Especially,the generation of relational features is composed of three parts:a)the feature vector of two pixels,b)the coordinated value of the two pixels,and c)the embedding representation of the question text Result The experiment is compared to the 2 datasets with the latest 6 methods.1)For the FigureQA(an annotated figure dataset for visual reasoning)dataset,compared to IMG+QUES(image+questions),relation networks(RN)and ARN(appearance and relation network),the overall accuracy rate is increased by 26.4%,8.1%,and 0.46%,respectively.2)For a single verification set,compared to LEAF-Net(locate,encode and attend for figure network)and FigureNet,the accuracy is increased by 2.3%and 2.0%of each.3)For the understanding data visualization via question answering(DVQA)dataset,the overall accuracy of the DVQA dataset is increased by 8.6%,0.12%,and 2.13%compared to SANDY(san with dynamic encoding model),ARN,and RN,and 4)For the Oracle version,compared to SANDY,LEAF-Net and RN,the overall accuracy rate has increased by 23.3%,7.09%,4.8%,respectively.Conclusion Our model has good results on the two large open source datasets in the statistical graph Q&A beyond baseline model.

作者黎颖吴清锋刘佳桐邹嘉龙 Li Ying;Wu Qingfeng;Liu Jiatong;Zou Jialong(School of Informatics,Xiamen University,Xiamen 361005,China)

机构地区厦门大学信息学院

出处《中国图象图形学报》 CSCD 北大核心 2023年第2期510-521,共12页 Journal of Image and Graphics

基金国家重点研发计划资助(2017YFC1703303) 福建省自然科学基金项目(2020J01435,2019J01846) 中国福建省对外合作项目(2019I0001)。

关键词计算机视觉图表问答(FQA) 多模态融合注意力机制关系网络(RN) 深度学习 computer vision figure question answering multimodal fusion attention mechanism relation network(RN) deep learning

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1闫茹玉,刘学亮.结合自底向上注意力机制和记忆网络的视觉问答模型[J].中国图象图形学报,2020,25(5):993-1006. 被引量：14
2杨佳林,郭学俊,陈泽华.改进U-Net型网络的遥感图像道路提取[J].中国图象图形学报,2021,26(12):3005-3014. 被引量：7

二级参考文献6

1吴亮,胡云安.遥感图像自动道路提取方法综述[J].自动化学报,2010,36(7):912-922. 被引量：57
2王建华,秦其明,高中灵,叶昕,孟晋杰.加入空间纹理信息的遥感图像道路提取[J].湖南大学学报（自然科学版）,2016,43(4):153-156. 被引量：10
3Weixing Wang,Nan Yang,Yi Zhang,Fengping Wang,Ting Cao,Patrik Eklund.A review of road extraction from remote sensing images[J].Journal of Traffic and Transportation Engineering(English Edition),2016,3(3):271-282. 被引量：15
4张永宏,何静,阚希,夏广浩,朱灵龙,葛涛涛.遥感图像道路提取方法综述[J].计算机工程与应用,2018,54(13):1-10. 被引量：40
5周远侠,于津.基于深度学习的图片问答系统设计研究[J].计算机应用与软件,2018,35(12):199-208. 被引量：4
6安如,冯学智,王慧麟.基于数学形态学的道路遥感影像特征提取及网络分析[J].中国图象图形学报（A辑）,2003,8(7):798-804. 被引量：52

共引文献19

1孙广路,吴猛,邱景,梁丽丽.针对长视频问答的深度记忆融合模型[J].哈尔滨理工大学学报,2021,26(1):1-8. 被引量：1
2邱南,顾玉宛,石林,李宁,庄丽华,徐守坤.基于复合图文特征的视觉问答模型研究[J].计算机应用研究,2021,38(8):2293-2298.
3张伟.基于关系感知双重注意力融合的视觉问答技术[J].南京工程学院学报（自然科学版）,2021,19(3):80-84.
4邹品荣,肖锋,张文娟,张万玉,王晨阳.面向视觉问答的多模块协同注意模型[J].计算机工程,2022,48(2):250-260. 被引量：6
5陈巧红,漏杨波,孙麒,贾宇波.基于多模态门控自注意力机制的视觉问答模型[J].浙江理工大学学报（自然科学版）,2022,47(3):413-423. 被引量：2
6兰红,张蒲芬.问题引导的空间关系图推理视觉问答模型[J].中国图象图形学报,2022,27(7):2274-2286. 被引量：4
7丁凯旋,陈雁翔,赵鹏铖,朱玉鹏,盛振涛.多负例对比机制下的跨模态表示学习[J].计算机工程与应用,2022,58(19):184-192. 被引量：1
8罗杨,万黎明,李理,刘知贵.基于改进U-Net网络的半监督裂缝分割方法[J].计算机技术与发展,2022,32(12):179-184. 被引量：1
9徐永会,杨德智,刘芳名.基于对数极坐标和频域率的互信息图像配准[J].舰船电子工程,2022,42(11):86-89.
10邹品荣,肖锋,张文娟,黄姝娟,张万玉.融合场景语义与空间关系的视觉问答[J].西安工业大学学报,2023,43(1):56-65. 被引量：1

同被引文献7

1Xinkun Chu,Hao Zhang,Zhiyu Tian,Qing Zhang,Fang Wang,Jing Chen,Yuanchao Geng.Detection of laser-induced optical defects based on image segmentation[J].High Power Laser Science and Engineering,2019,7(4):56-61. 被引量：4
2洛怡航,赵振宇,胡银记,揭斐然,万锦锦.基于孪生网络的轻量级高速跟踪算法[J].电光与控制,2022,29(1):51-55. 被引量：4
3张宇,张焱,石志广,张景华,刘荻,索玉昌,师晓冉,杜金明.基于图像衍生的红外无人机图像仿真方法研究[J].光学学报,2022,42(2):91-104. 被引量：4
4蒋镕圻,叶泽聪,彭月平,谢郭蓉,杜衡.针对弱小无人机目标的轻量级目标检测算法[J].激光与光电子学进展,2022,59(8):99-110. 被引量：6
5苏昂,陆伟康,张仕林,李璋.基于目标运动模型的无人机对地视觉目标跟踪[J].激光与光电子学进展,2022,59(14):233-241. 被引量：6
6王瑞绅,宋公飞,王明.引入ECA注意力机制的U-Net语义分割[J].电光与控制,2023,30(1):92-96. 被引量：6
7陈兵,贺晟,刘坚,陈圣峰,路恩会.基于轻量化DeepLab v3+网络的焊缝结构光图像分割[J].中国激光,2023,50(8):41-50. 被引量：8

引证文献1

1王洋,郭杜杜,王庆庆,周飞,秦音.基于改进DeepLabV3+的无人机高速公路护栏检测[J].激光与光电子学进展,2024,61(4):230-238.

1段淑艳,安玉萍,周伟.论神经内科护理中对脑卒中糖尿病患者康复护理的临床[J].中文科技期刊数据库（引文版）医药卫生,2021(7):61-62.
2Jia-zhi XIA,Yu-hong ZHANG,Hui YE,Ying WANG,Guang JIANG,Ying ZHAO,Cong XIE,Xiao-yan KUI,Sheng-hui LIAO,Wei-ping WANG.SuPoolVisor:a visual analytics system for mining pool surveillance[J].Frontiers of Information Technology & Electronic Engineering,2020,21(4):507-523. 被引量：5
3Jianguo Jiang,Baole Wei,Min Yu,Gang Li,Boquan Li,Chao Liu,Min Li,Weiqing Huang.An end-to-end text spotter with text relation networks[J].Cybersecurity,2021,4(1):91-103.
4吴晶晶.基于核心素养的初中化学课堂教学设计探索[J].中文科技期刊数据库（引文版）教育科学,2022(6):130-133.
5王义.干部“躺平”的表现形式、根源及破解路径[J].大连干部学刊,2023,39(2):42-47.
6Rui LIU,Yahong HAN.Instance-sequence reasoning for video question answering[J].Frontiers of Computer Science,2022,16(6):93-101. 被引量：1
7刘海军,无.用传统工艺的当代价值表达时代精神[J].中华手工,2023(2):102-105.
82023年组稿方向[J].计算机应用,2023,43(1).
92023年组稿方向[J].计算机应用,2023,43(2).
10Arsenij Ustjanzew,Jens Preussner,Mette Bentsen,Carsten Kuenne,Mario Looso.i2dash:Creation of Flexible,Interactive,and Web-based Dashboards for Visualization of Omics Data[J].Genomics, Proteomics & Bioinformatics,2022,20(3):568-577.

中国图象图形学报

2023年第2期

浏览历史

内容加载中请稍等...

引导性权重驱动的图表问答重定位关系网络被引量：1

参考文献2

二级参考文献6

共引文献19

同被引文献7

引证文献1

相关作者

相关机构

相关主题

浏览历史

引导性权重驱动的图表问答重定位关系网络 被引量：1

参考文献2

二级参考文献6

共引文献19

同被引文献7

引证文献1

相关作者

相关机构

相关主题

浏览历史

引导性权重驱动的图表问答重定位关系网络被引量：1