摘要
模型可解释性长期以来一直是人工智能领域备受关注的问题.在视觉问答(Visual Question Answering,VQA)系统中,特别需要处理视觉(图像)和语言(问题)之间的协同推理,以产生解释性强且可靠的答案.然而,现有方法通常集中于单独处理视觉和语言特征,未能捕捉到VQA所需的高低级交互关系,也未能提供答案生成过程的解释.针对以上问题,该研究提出一种创新方法,即基于Transformer的可解释路径VQA方法.首先利用Transformer编码器层分别提取预训练的卷积神经网络(Convolutional Neural Network,CNN)和领域特定语言模型(Language Model,LM)的视觉和语言特征.随后,解码器层被嵌入,并对编码特征进行上采样,用于最终的VQA预测.通过在具有挑战性的VQA-X数据集和e-SNLI-VE数据集上大量的实验验证了该方法的有效性.实验结果表明:该方法在定性和定量评估方面明显优于其他先进方法,不仅有助于解释VQA模型的单幅图像结果,还为理解VQA模型的行为提供了有益参考.
The interpretability of models has long been a prominent challenge in the field of artificial intelligence.In Visual Question Answering(VQA)systems,particularly,there is a critical need to facilitate collaborative reasoning between visual(image)and linguistic(question)components in order to generate answers that are both highly interpretable and reliable.However,existing methods often focus on separately handling visual and linguistic features,failing to capture the intricate interplay required for VQA and lacking in providing explanations for the answer generation process.To address these issues,this study explores and introduces an innovative approach,known as Interpretable Transformer-Based Path Visual Question Answering.This method begins by leveraging Transformer encoder layers to separately extract visual and linguistic features from pre-trained Convolutional Neural Network(CNN)and domain-specific language model(LM).Subsequently,decoder layers are embedded to upsample encoded features for the final VQA predictions.Extensive experiments conducted on challenging VQA-X datasets and e-SNLI-VE datasets validate the effectiveness of this approach.Experimental results indicated that the proposed method outperforms other state-of-the-art methods in qualitative and quantitative evaluations.This research not only contributes to elucidating single-image results in VQA models but also provides profound insights into understanding the behavior of VQA models.
作者
原蕾
王科俊
YUAN Lei;WANG Kejun(School of Information Engineering,Zhengzhou Technology and Business University,Zhengzhou 451400,China;School of Information,Beijing Institute of Technology(Zhuhai Campus),Zhuhai Guangdong 519088,China;College of Intelligent Systems Science and Engineering,Harbin Engineering University,Harbin 150001,China)
出处
《西南大学学报(自然科学版)》
CAS
CSCD
北大核心
2024年第10期212-221,共10页
Journal of Southwest University(Natural Science Edition)
基金
教育部产学合作协同育人项目(220600440151815)。
关键词
可解释
视觉系统
人工智能
神经网络
转换器
interpretable
vision systems
artificial intelligence
neural networks
converters