摘要
视觉语言导航,即在一个未知环境中,智能体从一个起始位置出发,结合指令和周围视觉环境进行分析,并动态响应生成一系列动作,最终导航到目标位置.视觉语言导航有着广泛的应用前景,该任务近年来在多模态研究领域受到了广泛关注.不同于视觉问答和图像描述生成等传统多模态任务,视觉语言导航在多模态融合和推理方面,更具有挑战性.然而由于传统模仿学习的缺陷和数据稀缺的现象,模型面临着泛化能力不足的问题.系统地回顾了视觉语言导航的研究进展,首先对于视觉语言导航的数据集和基础模型进行简要介绍;然后全面地介绍视觉语言导航任务中的代表性模型方法,包括数据增强、搜索策略、训练方法和动作空间四个方面;最后根据不同数据集下的实验,分析比较模型的优势和不足,并对未来可能的研究方向进行了展望.
Vision-and-language navigation means that an agent in an unknown environment,starting from a starting location,dynamically generates a series of actions by making analysis with language instructions and the visual environment,and finally navigates to the goal location.And due to the widespread application prospect,in recent years,it has received increasing attention from researchers especially in multi-modal research.It is different from traditional multi-modal tasks such as vision question answer and image captioning,vision-and-language navigation is more challenging in terms of dynamic reasoning and multi-modal fusion.However,with the limitations of imitation learning and the phenomenon of data scarcity,the model is faced with the problem of insufficient generalization.In this paper,we review the current advances in the research of vision-and-language navigation.Firstly,we briefly introduce data sets in visual-and-language navigation.Then,we comprehensively introduce the representative models in vision-and-language navigation,including data augmentation,search strategies,training methods and action spaces.Finally,from the experiments under different data sets,we analyze the advantages and disadvantages of the existing models,and prospect some future and possible research directions.
作者
司马双霖
黄岩
何科技
安东
袁辉
王亮
SIMA Shuang-Lin;HUANG Yan;HE Ke-Ji;AN Dong;YUAN Hui;WANG Liang(Center of Research on Intelligent Perception and Computing,Institute of Automation,Chinese Academy of Sciences,Beijing 100190;School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049;National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190;Center for Excellence in Brain Science and Intelligence Technology,Institute of Automation,Chinese Academy of Sciences,Shanghai 200031;Artificial Intelligence Research,Chinese Academy of Sciences,Jiaozhou 266300)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2023年第1期1-14,共14页
Acta Automatica Sinica
关键词
视觉语言导航
视觉语言理解
跨模态匹配
具身智能
Vision-and-language navigation
vision-and-language comprehension
cross-modal matching
embodied artificial intelligence