期刊文献+

视觉提示学习综述

Visual Prompt Learning:A Survey
下载PDF
导出
摘要 近年来,随着提示学习方法在自然语言处理领域被提出,其日益受到研究人员广泛关注,它通过将各类下游任务重构成预训练任务的形式,以参数高效和数据高效的方式将大规模预训练模型应用在各类自然语言相关下游任务中.其中以GPT系列为代表的模型通过提示学习在对话生成和多模态图文理解等任务上取得了巨大的成功.然而,这类模型及方法还不能解决视觉中的稠密任务.受此启发,一些研究人员逐渐将提示学习广泛应用到视觉相关的各类任务当中,如图像识别、目标检测、图像分割、领域适应、持续学习等.由于目前还没有提示学习应用在视觉相关领域中的综述,本文将对视觉单模态领域以及视觉语言多模态领域的提示学习方法展开全面论述和分析.作为回顾,我们首先简要介绍自然语言处理领域的预训练模型,并对提示学习的基本概念、下游应用形式以及提示模板类型进行阐述和分类.其次,我们分别介绍视觉单模态领域以及视觉语言多模态领域里提示学习方法适配的预训练模型和任务.再次,我们分别介绍视觉单模态领域以及视觉语言多模态领域的提示学习方法.在自然语言处理领域,提示学习方法以继承预训练形式实现多任务统一为主要目的;与此不同,在视觉相关领域,提示学习方法侧重于面向特定下游任务进行设计.为此,我们将从方法设计上进行简单分类,然后从应用任务角度详细介绍视觉单模态提示学习和视觉语言多模态提示学习方法.最后,我们对比分析了自然语言处理领域和视觉相关领域提示学习研究的进展,并对未来研究路线给出了展望。 With the rapid development of deep learning models and the increasing parameter size,fine-tuning the entire model in various downstream applications with different objectives is prohibitive.To solve this significant issue,prompt learning has been primarily proposed in the field of natural language processing(NLP),and has been widely studied in recent years.By reformulating various downstream tasks as the same form of the pre-training one,prompt learning successfully leverages large-scale pre-trained language models in various downstream applications with great efficiency from both the parameter and data perspectives.Among them,models pre-trained by masked language modeling(MLM)represented by BERT have achieved great success in tasks requiring word-level output such as text classification,named entity recognition by"cloze prompt";models pre-trained via autoregressive/casual language modeling(A/CLM)such as GPT have been widely applied in tasks requiring text-level output using"prefix prompt",the tasks include dialogue generation,question answering,summarization,etc.Witnessing the success of prompt learning in NLP area,language models have also been applied in multimodal vision-language understanding problems through prompt learning.However,they still could not solve dense tasks in vision-related area.In addition,the expensive and complex process of fine-tuning the entire vision model in practical applications also occurs in vision-related area.Inspired by the great success of prompt learning in NLP,it has been gradually applied to various vision-related tasks,including image classification,object detection,image segmentation,domain adaptation,continual learning,etc.Seeing the lack of a comprehensive survey of prompt learning in vision area,therefore,this paper aims at conducting a comprehensive introduction and analysis on the prompt learning methods in unimodal vision area and multimodal vision-language area.First,we briefly introduce the pre-training models,the basic concepts of prompt learning,the forms of downstream applications,and the types of prompts in NLP as the preliminary.Second,we deliver the pre-training models that adopted in unimodal vision and multimodal vision-language prompt learning methods,respectively.Then,we give a comprehensive introduction to the prompt learning methods in vision-related areas.It is worth mentioning that prompt learning methods in NLP are designed for inheriting the pre-training tasks in all downstream applications.Differently,current prompt learning methods in unimodal vision and multimodal vision-language fields are designed for specific downstream applications.Therefore,we will conduct a brief introduction from the method design,and then give the details of unimodal visual prompt learning and multimodal vision-language prompt learning methods from the perspective of appli-cation tasks.On the one side,unimodal visual prompt learning methods are mainly designed by concatenating learnable prompt tokens,adding optimizable pixel-wise perturbations,learning prompt networks,combining multiple prompt modules,constructing the label mapping,neural architecture search,etc.On the other side,the popular design of multimodal vision-language prompt learning methods includes textual prompt learning,vision-guided textual prompt learning,text or knowledge-guided textual prompt learning,vision-language joint prompt learning,distribution-based prompt learning,multitask-shared prompt learning,gradient-guided prompt learning,etc.Finally,we make an in-depth analysis and comparison between the prompt learning methods in NLP and vision-related fields,and propose a prospect and summary for future research.
作者 廖宁 曹敏 严骏驰 LIAO Ning;CAO Min;YAN Jun-Chi(Key Laboratory of Artificial Intelligence,Ministry of Education,Shanghai Jiao Tong University,Shanghai200240;School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215021)
出处 《计算机学报》 EI CAS CSCD 北大核心 2024年第4期790-820,共31页 Chinese Journal of Computers
基金 国家自然科学基金优秀青年科学基金项目(No.62222607) 上海市级科技重大专项(No.2021SHZDZX0102) 国家自然科学基金(No.62002252)资助。
关键词 大规模预训练模型 自然语言处理 视觉单模态提示学习 视觉语言多模态提示学习 large-scale pre-trained model natural language processing unimodal visual prompt learning multimodal vision-language prompt learning
  • 相关文献

参考文献2

二级参考文献2

共引文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部