利用人工智能神经网络体系结构生成视觉问答系统中的自然语言解释

Natural Language Explanations in Visual Question Answering Systems Generated Using Artificial Intelligence Neural Network Architectures

下载PDF

导出

摘要模型可解释性长期以来一直是人工智能领域备受关注的问题.在视觉问答(Visual Question Answering,VQA)系统中,特别需要处理视觉(图像)和语言(问题)之间的协同推理,以产生解释性强且可靠的答案.然而,现有方法通常集中于单独处理视觉和语言特征,未能捕捉到VQA所需的高低级交互关系,也未能提供答案生成过程的解释.针对以上问题,该研究提出一种创新方法,即基于Transformer的可解释路径VQA方法.首先利用Transformer编码器层分别提取预训练的卷积神经网络(Convolutional Neural Network,CNN)和领域特定语言模型(Language Model,LM)的视觉和语言特征.随后,解码器层被嵌入,并对编码特征进行上采样,用于最终的VQA预测.通过在具有挑战性的VQA-X数据集和e-SNLI-VE数据集上大量的实验验证了该方法的有效性.实验结果表明:该方法在定性和定量评估方面明显优于其他先进方法,不仅有助于解释VQA模型的单幅图像结果,还为理解VQA模型的行为提供了有益参考. The interpretability of models has long been a prominent challenge in the field of artificial intelligence.In Visual Question Answering(VQA)systems,particularly,there is a critical need to facilitate collaborative reasoning between visual(image)and linguistic(question)components in order to generate answers that are both highly interpretable and reliable.However,existing methods often focus on separately handling visual and linguistic features,failing to capture the intricate interplay required for VQA and lacking in providing explanations for the answer generation process.To address these issues,this study explores and introduces an innovative approach,known as Interpretable Transformer-Based Path Visual Question Answering.This method begins by leveraging Transformer encoder layers to separately extract visual and linguistic features from pre-trained Convolutional Neural Network(CNN)and domain-specific language model(LM).Subsequently,decoder layers are embedded to upsample encoded features for the final VQA predictions.Extensive experiments conducted on challenging VQA-X datasets and e-SNLI-VE datasets validate the effectiveness of this approach.Experimental results indicated that the proposed method outperforms other state-of-the-art methods in qualitative and quantitative evaluations.This research not only contributes to elucidating single-image results in VQA models but also provides profound insights into understanding the behavior of VQA models.

作者原蕾王科俊 YUAN Lei;WANG Kejun(School of Information Engineering,Zhengzhou Technology and Business University,Zhengzhou 451400,China;School of Information,Beijing Institute of Technology(Zhuhai Campus),Zhuhai Guangdong 519088,China;College of Intelligent Systems Science and Engineering,Harbin Engineering University,Harbin 150001,China)

机构地区郑州工商学院信息工程学院北京理工大学(珠海校区)信息学院哈尔滨工程大学智能科学与工程学院

出处《西南大学学报（自然科学版）》 CAS CSCD 北大核心 2024年第10期212-221,共10页 Journal of Southwest University(Natural Science Edition)

基金教育部产学合作协同育人项目(220600440151815)。

关键词可解释视觉系统人工智能神经网络转换器 interpretable vision systems artificial intelligence neural networks converters

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1袁牧,张兰,姚云昊,张钧洋,罗溥晗,李向阳.面向智能物联网的资源高效模型推理综述[J].计算机学报,2024,47(10):2247-2273.
2焉学灏,阮馨瑶.一种可变形网络小样本字体生成方法[J].大连民族大学学报,2024,26(5):449-453.
3张明,廖希.基于人工智能神经网络的上下文介词消歧方法[J].西南大学学报（自然科学版）,2024,46(10):222-232.
4Qiang Sun,Yan-Wei Fu,Xiang-Yang Xue.Learning a Mixture of Conditional Gating Blocks for Visual Question Answering[J].Journal of Computer Science & Technology,2024,39(4):912-928.
5马爱迪,张玥,蒋奇欣,曲琰,张秋华.雷公藤甲素调控宫颈癌顺铂耐药和血管生成相关研究[J].世界中医药,2024,19(17):2572-2577.
6马瑞丰.民族互嵌何以生成--河西走廊族际生计嵌合的结构性机理[J].西北民族研究,2024(4):118-128.
7蒋汶娟,过弋,付娇娇.融合图注意力的复杂时序知识图谱推理问答模型[J].计算机应用,2024,44(10):3047-3057.
8丁遒劲,苏静.面向生成式人工智能(AIGC)的图书馆信息资源建设优化策略研究[J].图书情报工作,2024,68(18):23-31.
9张智星,付翔,张小强,李浩杰,秦一凡,刘萌,孙岩,贾一帆,杨宇琪.煤矿工业数据AI模型自动推理技术[J].工矿自动化,2024,50(9):138-143.
10Yuan-Feng Song,Yuan-Qin He,Xue-Fang Zhao,Han-Lin Gu,Di Jiang,Hai-Jun Yang,Li-Xin Fan.A Communication Theory Perspective on Prompting Engineering Methods for Large Language Models[J].Journal of Computer Science & Technology,2024,39(4):984-1004.

西南大学学报（自然科学版）

2024年第10期

浏览历史

内容加载中请稍等...

利用人工智能神经网络体系结构生成视觉问答系统中的自然语言解释

相关作者

相关机构

相关主题

浏览历史