摘要
视觉问答是计算机视觉和自然语言处理的交叉领域。在视觉问答的任务中,机器首先需要对图像、文本这两种模态数据进行编码,进而学习这两种模态之间的映射,实现图像特征和文本特征的融合,最后给出答案。视觉问答任务考验模型对图像的理解能力以及对答案的推理能力。视觉问答是实现跨模态人机交互的重要途径,具有广阔的应用前景。最近相继涌现出了众多新兴技术,如基于场景推理的方法、基于对比学习的方法和基于三维点云的方法。但是,视觉问答模型普遍存在推理能力不足、缺乏可解释性等问题,值得进一步地探索与研究。文中对视觉问答领域的相关研究和新颖方法进行了深入的调研和总结。首先介绍了视觉问答的背景;其次分析了视觉问答的研究现状并对相关算法的和数据集进行了归纳总结;最后根据当前模型存在的问题对视觉问答的未来研究方向进行了展望。
Visual question answering(VQA)is an interdisciplinary research paradigm that involves computer vision and natural language processing.VQA generally requires both image and text data to be encoded,their mappings learned,and their features fused,before finally generating an appropriate answer.Image understanding and result reasoning are therefore vital to the performance of VQA.With its importance in realizing cross-modal human-computer interaction and its promising applications,a number of emerging techniques for VQA,including scene-reasoning based methods,contrastive-learning based methods,and 3D-point-cloud based methods,have been recently proposed.These methods,while achieving notable performances,have revealed issues such as insufficient inferential capability and interpretability,which demand further exploration.We hence present in this paper an in-depth survey and summary of related research and proposals in the field of VQA.The essential background of VQA is first introduced,followed by the analysis and summarization of state-of-art approaches and datasets.Last but not least,with the insight of current issues,future research directions in the field of VQA are prospected.
作者
李祥
范志广
李学相
张卫星
杨聪
曹仰杰
LI Xiang;FAN Zhiguang;LI Xuexiang;ZHANG Weixing;YANG Cong;CAO Yangjie(School of Cyber science and Engineering,Zhengzhou University,Zhengzhou 450000,China;Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450000,China)
出处
《计算机科学》
CSCD
北大核心
2023年第5期177-188,共12页
Computer Science
基金
国家自然科学基金面上项目(61972092)
郑州市协同创新重大专项(20XTZX06013)。
关键词
视觉问答
跨模态
人机交互
推理能力
可解释性
Visual question answering
Cross-modal
Human-Computer interaction
Reasoning ability
Interpretability