期刊文献+

视觉语言预训练综述 被引量:3

Survey on Vision-language Pre-training
下载PDF
导出
摘要 近年来深度学习在计算机视觉(CV)和自然语言处理(NLP)等单模态领域都取得了十分优异的性能.随着技术的发展,多模态学习的重要性和必要性已经慢慢展现.视觉语言学习作为多模态学习的重要部分,得到国内外研究人员的广泛关注.得益于Transformer框架的发展,越来越多的预训练模型被运用到视觉语言多模态学习上,相关任务在性能上得到了质的飞跃.系统地梳理了当前视觉语言预训练模型相关的工作,首先介绍了预训练模型的相关知识,其次从两种不同的角度分析比较预训练模型结构,讨论了常用的视觉语言预训练技术,详细介绍了5类下游预训练任务,最后介绍了常用的图像和视频预训练任务的数据集,并比较和分析了常用预训练模型在不同任务下不同数据集上的性能. In recent years,deep learning has achieved excellent performance in unimodal areas such as computer vision(CV)and natural language processing(NLP).With the development of technology,the importance and necessity of multimodal learning begin to unfold.Essential to multimodal learning,vision-language learning has received extensive attention from researchers in and outside China.Thanks to the development of the Transformer framework,more and more pre-trained models are applied to vision-language multimodal learning,and the performance of related tasks is improved qualitatively.This study systematically reviews the current work on vision-language pretrained models.Firstly,the knowledge about pre-trained models is introduced.Secondly,the structure of pre-trained models is analyzed and compared from two perspectives.The commonly used vision-language pre-training techniques are discussed,and five downstream pretraining tasks are elaborated.Finally,the common datasets used in image and video pre-training tasks are expounded,and the performance of commonly used pre-trained models on different datasets under different tasks is compared and analyzed.
作者 殷炯 张哲东 高宇涵 杨智文 李亮 肖芒 孙垚棋 颜成钢 YIN Jiong;ZHANG Zhe-Dong;GAO Yu-Han;YANG Zhi-Wen;LI Liang;XIAO Mang;SUN Yao-Qi;YAN Cheng-Gang(College of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China;Lishui Institute of Hangzhou Dianzi University,Lishui 323000,China;School of Automation,Hangzhou Dianzi University,Hangzhou 210016,China;Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;Sir Run Run Shaw Hospital,College of Medicine,Zhejiang University,Hangzhou 310016,China)
出处 《软件学报》 EI CSCD 北大核心 2023年第5期2000-2023,共24页 Journal of Software
基金 国家重点研发计划(2020YFB1406604) 国家自然科学基金(61931008,62071415,U21B2024)。
关键词 多模态学习 预训练模型 TRANSFORMER 视觉语言学习 multimodal learning pre-trained model Transformer vision-language learning
  • 相关文献

参考文献2

二级参考文献2

共引文献29

同被引文献6

引证文献3

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部