摘要
视觉自动问答技术是一个新兴的多模态学习任务,它联系了图像内容理解和文本语义推理,针对图像和问题给出对应的回答.该技术涉及多种模态交互,对视觉感知和文本语义学习有较高的要求,受到了广泛的关注.然而,视觉自动问答模型的训练对数据集的要求较高.它需要多种多样的问题模式和大量的相似场景不同答案的问题答案标注,以保证模型的鲁棒性和不同模态下的泛化能力.而标注视觉自动问答数据需要花费大量的人力物力,高昂的成本成为制约该领域发展的瓶颈.针对这个问题,本文提出了基于跨模态特征对比学习的视觉问答主动学习方法(CCRL).该方法从尽可能覆盖更多的问题类型和尽可能获取更平衡的问题分布两方面出发,设计了视觉问题匹配评价(VQME)模块和视觉答案不确定度度量(VAUE)模块.视觉问题评价模块使用了互信息和对比预测编码作为自监督学习的约束,学习视觉模态和问题模式的匹配关系.视觉答案不确定性模块引入了标注状态学习模块,自适应地选择匹配的问题模式并学习跨模态问答语义关联,通过答案项的概率分布评估样本不确定度,寻找最有价值的未标注样本进行标注.在实验部分,本文在视觉问答数据集VQA-v2上将CCRL和其他最新的主动学习算法进行了性能比较,实验结果表明该方法在各个问题模式下均超越之前的方法,该方法对比当前性能最好的主动学习方法在不同的采样率下平均提升了1.65%的准确率.在仅标注30%的数据下,该方法可以达到100%样本标注下性能的96%;在40%的标注比例之下,该方法可以达到100%样本标注下性能的97%.这说明该方法可以选取出具有高指导价值的样本,节约了标注花费的同时最大化视觉自动问答的模型性能.
Visual question answer(VQA)is a newly developing multi-modal learning task that bridges both the comprehensions of the visual content and the textual question to generate a corresponding answer.It attracts a lot of attention from the community and involves the interaction of different modalities,which requires the capability of image perception and textual semantic learning.However,the training of VQA has great requirements for the dataset.It requires a wide variety of question patterns and a large number of question answer annotations with different answers for similar scenarios to ensure the robustness of the model and the generalization ability under different modalities.Thus,it is very time-consuming and expensive to label a VQA dataset,which becomes a bottleneck for the development of VQA.In view of these problems,this paper proposes a contrastive cross-modal representation learning based active learning(CCRL)method for VQA.The key idea of CCRL is to cover more question patterns and make the distribution of answers more balanced.It consists of a visual question matching evaluation(VQME)module and a visual answer uncertainty estimation(VAUE)module.The visual question matching evaluation module utilizes mutual information and contrastive predictive coding as the constraints to learn the alignment relationship between visual content and question pattern.The answer uncertainty module introduces the label state learning model.It selects matched question patterns for each image and learn the semantic relationship between cross-modal questions and answers.Then the model estimates the uncertainty of the answer based on the distribution of its probability,by which CCRL can select most informative samples and label them.In the experiment,this work implements the latest active learning algorithms on the VQA task and performs performance evaluation on VQA-v2 dataset.The experimental results demonstrate that CCRL outperforms the previous methods in all question patterns and averagely improves the accuracy by 1.65%compared to the state-of-the-art active learning method.With 30%labeled samples,CCRL achieves 96%of the performance with 100%labeled data.With 40%labeled samples,CCRL achieves 97%of the performance with 100%labeled data.This indicates that CCRL can select instructive and diverse samples,which greatly cuts down the annotation cost and maximizes the VQA performance respectively.
作者
张北辰
李亮
查正军
黄庆明
ZHANG Bei-Chen;LI Liang;ZHA Zheng-Jun;HUANG Qing-Ming(School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 101408;Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190;School of Information Science and Technology,University of Science and Technology of China,Hefei 230027;Peng Cheng Laboratory,Shenzhen,Guangdonog 518055)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第8期1730-1745,共16页
Chinese Journal of Computers
基金
科技部科技创新2030-“新一代人工智能”重大项目(2018AAA0102000)
国家自然科学基金(61732007,61771457,U21B2038)
中国科学院青年创新促进会(20200108)
中央高校基本科研业务费专项资金资助.
关键词
主动学习
跨模态语义推理
对比学习
视觉问答
互信息
active learning
cross-modal semantic reasoning
contrastive learning
visual question answer
mutual information