分割一切模型SAM的潜力与展望:综述

Potential and prospects of segment anything model:a survey

导出

摘要随着基于对比文本—图像对的预训练(contrastive language-image pre-training,CLIP)方法或者模型、聊天生成预训练转换器(chat generative pre-trained Transformer,ChatGPT)、生成预训练转换器-4(generative pre-trained Transformer-4,GPT-4)等基础大模型的出现,通用人工智能(artificial general intelligence, AGI)的研究得到快速发展。AGI旨在为人工智能系统赋予更强大的执行能力,使其能够自主学习、不断进化,解决各种问题和处理不同的任务,从而在多个领域得到广泛应用。这些基础模型在大规模数据集上进行训练后,能够成功应对多样的下游任务。在这一背景下,Meta公司提出的分割一切模型(segment anything model,SAM)于2023年取得重要突破,在图像分割领域获得了优异的性能,以至于被称为图像分割终结者。其原因之一是,通过SAM数据引擎方法用三阶段采集的、包含1 100万图像和超过10亿掩码的分割一切—十亿(segment anything 1 billion,SA-1B)图像分割数据集,同时保证了掩码的品质和多样性,继续导致在分割领域的突破。在SAM开源后不久,科研人员提出了一系列改进的方法和应用。为了能全面深入了解分割一切模型的发展脉络、优势与不足,本文对SAM的研究进展进行了梳理和综述。首先,从基础模型、数据引擎和数据集等多个方面简要介绍了分割一切模型的背景和核心框架。在此基础上,本文详细梳理了目前分割一切模型的改进方法,包括提高推理速度和增进预测精度两个关键方向。然后,深入探讨分割一切模型在图像处理任务、视频相关任务以及其他领域中的广泛应用。这一部分详细介绍了模型在各种任务和数据类型上的卓越性能,突出其在多个领域的泛用性和发展潜力。最后,对分割一切模型未来的发展方向和潜在应用前景进行了深入分析和讨论。 The emergence of foundational large-scale models,such as contrastive language-image pre-training(CLIP),chat generative pre-trained Transformer(ChatGPT),and generative pre-trained Transformer-4(GPT-4),has facilitated thesignificant growth of the field of artificial general intelligence(AGI).AGI aims to imbue systems with the ability to performvarious tasks,which enables them to learn autonomously and evolve.This broad applicability spans various domains and is intended to address diverse problems and accomplish numerous downstream tasks.These models,after being trained onmassive datasets,possess the capability to handle a multitude of downstream tasks.In this context,Meta’s segment any⁃thing model(SAM)has substantially progressed and introduced the largest image segmentation dataset to date,that is,SA1B.This dataset includes over 11 million images and more than one billion mask in 2023.One reason is that SA-1B wascollected through SAM’s data engine approach in three stages.This approach simultaneously ensures the quality and diver⁃sity of these masks,which contributes significantly to breakthroughs in the segmentation domain.This development hasprofoundly impacted the advancements in the foundational models in the field of computer vision.This study provides acomprehensive understanding of the SAM framework through a detailed review and analysis of relevant research.First,thisstudy delves into three aspects of the background and basic framework of the SAM model.The first aspect involves the tasksof SAM,including traditional image segmentation and prompt-guided interactive image segmentation.The second aspect isthe model architecture of SAM,encompassing image encoders,prompt encoders,and mask decoders.The third aspectrevolves around the data,including the data engine for collecting datasets and dataset SA-1B.Building upon this founda⁃tion,the study then organizes and analyzes methods for improving the SAM model from two perspectives.The first perspec⁃tive is enhancing inference speed.The reason is that improved inference speed reduces the deployment costs of SAM,which makes it more convenient for application on less powerful devices.The second perspective is enhancing predictionaccuracy.Notably,SAM itself lacks specific semantic information,which leads to suboptimal segmentation results in com⁃plex scenarios.Thus,considerable research focuses on enhancing the prediction accuracy of SAM.Subsequently,thestudy thoroughly reviews and analyzes the current applications of the SAM model in various tasks and data types.Theseapplications are divided into three parts:the first part covers applications in image processing-related tasks,including styletransfer,object detection,object counting,image editing,complex image segmentation,and medical image segmentation.However,applying SAM directly to medical image segmentation may not yield satisfactory results,which suggests the needfor further adjustments in specific scenario tasks.The second part encompasses applications in video-related tasks,includ⁃ing video super-resolution,video object tracking,and audio–visual scene segmentation.The third part explores applica⁃tions in other directions,such as point cloud segmentation,3D reconstruction,controllable image caption generation,anddata annotation.Through the organization of the applications of SAM in the three parts,the study summarizes the advan⁃tages and limitations of applying SAM to various downstream tasks.These analyses can assist researchers in better applyingand improving SAM,which enhances its robustness and generalization capabilities.Finally,the study proposes severalvaluable future research directions for the SAM model.These directions include:1)modularization:although SAM hasalready demonstrated excellent performance in certain tasks,its efficiency and flexibility still need to be improved.Withthe continuous expansion of SAM application domains,many applications have put forward the requirement for SAM to pos⁃sess new knowledge.Therefore,the model is required to have domain adaptation and continuous learning capabilities.Drawing inspiration from large language models,new modular structures can be added to SAM to enhance its domain adap⁃tation and continuous learning capabilities.2)Weakly supervised semantic segmentation:in weakly supervised semanticsegmentation,retraining model classification and generating pseudo-labels are typically necessary,but they involve timeconsuming and intricate steps.Recent studies use SAM as a base model in this domain,which capitalizes on its strong gen⁃eralization for satisfactory results without fine-tuning.However,although SAM can produce relatively clear results in manyexplicit scenarios,SAM has difficulty generating accurate segmentation masks in certain semantically ambiguous scenariosbecause its model does not contain semantic information.We can consider using more diverse weak labels for SAM andincorporating additional post-processing modules to enhance the segmentation accuracy of SAM and improve its perfor⁃mance in weakly supervised semantic segmentation for solving the abovementioned complexity.Exploring the application ofSAM as a foundational model in weakly supervised semantic segmentation,which potentially yields promising results.3)Multimodal fusion for image segmentation:at present,the prompt input of SAM mainly includes four forms:point,tar⁃get box,split mask,and text prompt.However,the continuous expansion of the application areas of SAM has introducednew requirements for cue input forms.The current focus of SAM is on 2D visual tasks,with potential consideration forfuture applications in 3D visual tasks.These applications include considering different input modalities for SAM prompts,introducing time-series prompts to address the limitations of SAM in video processing tasks,and further improving the performance of SAM in various video downstream tasks.4)Efficient fine-tuning of SAM:although SAM has been widely usedin various domains,its performance still falls short compared with other state-of-the-art models in the domain in certain spe⁃cific application scenarios.Studies have shown that its performance is improved by fine-tuning SAM for domain-specificdatasets.However,the fine-tuning process is costly due to the large size of the SAM model.Therefore,performing finetuning efficiently becomes an important issue.Given the substantial parameter count of SAM,incorporating new modulesinto the model,freezing its core during training,and only training the newly added modules significantly reduce the train⁃ing cost.This approach facilitates further research on the application of SAM in various downstream tasks.5)Leveraginggestalt psychology’s holistic cognitive perspective to enhance SAM’s adversarial robustness:the vulnerability of SAM toattacks may be due to overfitting on local cognitions.Introducing holistic cognition can prevent overfitting on local cognitionand resist attacks involving noise.By consolidating and summarizing SAM in this study,SAM can be further developed andapplied to drive the advancement of foundational models in the field of computer vision.

作者王淼黄智忠何晖光卢湖川单洪明张军平 Wang Miao;Huang Zhizhong;He Huiguang;Lu Huchuan;Shan Hongming;Zhang Junping(School of Computer Science,Fudan University,Shanghai 200437,China;Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China;School of Information and Communication Engineering,Dalian University of Technology,Dalian 116024,China;Institute of science and Technology for Brain-Inspired Intelligence,Fudan University,Shanghai 200433,China)

机构地区复旦大学计算机科学技术学院中国科学院自动化研究所大连理工大学信息与通信工程学院复旦大学类脑智能科学与技术研究院

出处《中国图象图形学报》 CSCD 北大核心 2024年第6期1479-1509,共31页 Journal of Image and Graphics

基金国家自然科学基金项目(62176059)。

关键词通用人工智能(AGI) 计算机视觉图像分割视觉基础模型分割一切模型(SAM) 大型语言模型(LLM) artificial general intelligence(AGI) computer vision image segmentation visual foundational models seg⁃ment anything model(SAM) large language model(LLM)

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1刘苏毅,迟剑宁,吴成东,徐方.基于递归切片网络的三维点云语义分割与实例分割[J].中国图象图形学报,2023,28(7):2135-2150. 被引量：3
2蒋婷,李晓宁.采用多尺度视觉注意力分割腹部CT和心脏MR图像[J].中国图象图形学报,2024,29(1):268-279. 被引量：3
3严毅,邓超,李琳,朱凌坤,叶彪.深度学习背景下的图像语义分割方法综述[J].中国图象图形学报,2023,28(11):3342-3362. 被引量：5
4Jie ZHOU,Pei KE,Xipeng QIU,Minlie HUANG,Junping ZHANG.ChatGPT: potential, prospects, and limitations[J].Frontiers of Information Technology & Electronic Engineering,2024,25(1):6-11. 被引量：20
5赵什陆,张强.深度学习多模态图像语义分割前沿进展[J].中国图象图形学报,2023,28(11):3320-3341. 被引量：2
6张文凯,刘文杰,孙显,许光銮,付琨.多源特征自适应融合网络的高分遥感影像语义分割[J].中国图象图形学报,2022,27(8):2516-2526. 被引量：4

二级参考文献15

1郑光远,刘峡壁,韩光辉.医学影像计算机辅助检测与诊断系统综述[J].软件学报,2018,29(5):1471-1514. 被引量：72
2景庄伟,管海燕,臧玉府,倪欢,李迪龙,于永涛.基于深度学习的点云语义分割研究综述[J].计算机科学与探索,2021,15(1):1-26. 被引量：41
3段立娟,孙启超,乔元华,陈军成,崔国勤.基于注意力感知和语义感知的RGB-D室内图像语义分割算法[J].计算机学报,2021,44(2):275-291. 被引量：16
4殷晓航,王永才,李德英.基于U-Net结构改进的医学影像分割技术综述[J].软件学报,2021,32(2):519-550. 被引量：51
5杜静,蔡国榕.多特征融合与残差优化的点云语义分割方法[J].中国图象图形学报,2021,26(5):1105-1116. 被引量：9
6龙霄潇,程新景,朱昊,张朋举,刘浩敏,李俊,郑林涛,胡庆拥,刘浩,曹汛,杨睿刚,吴毅红,章国锋,刘烨斌,徐凯,郭裕兰,陈宝权.三维视觉前沿进展[J].中国图象图形学报,2021,26(6):1389-1428. 被引量：33
7方国润.三维点云分割研究综述[J].计量与测试技术,2021,48(7):52-55. 被引量：9
8张珂,冯晓晗,郭玉荣,苏昱坤,赵凯,赵振兵,马占宇,丁巧林.图像分类的深度卷积神经网络模型综述[J].中国图象图形学报,2021,26(10):2305-2325. 被引量：104
9王涛,王文举,蔡宇.基于深度学习的三维点云语义分割方法研究[J].计算机工程与应用,2021,57(23):18-26. 被引量：13
10Li WEIGANG,Lirian Michi ENAMOTO,Denise Leyi LI,Geraldo Pereira ROCHA FILHO.New directions for artificial intelligence:human,machine,biological,and quantum intelligence[J].Frontiers of Information Technology & Electronic Engineering,2022,23(6):984-990. 被引量：4

共引文献31

1刘子晴,王薪宇,杨锋,李方正.城市更新背景下融合深度学习的非正式绿地数字识别技术研究进展[J].中国园林,2023,39(6):33-38. 被引量：1
2卢健,赵杰,郭会会,梁有成,郑雨飞.联合边界感知和多特征融合的点云语义分割方法[J].西安工程大学学报,2023,37(6):137-144. 被引量：1
3孙维维,潘贤章,刘杰,郭观林,李衍,王娟,项钰,王睿.不同自然语言处理方法在土壤环境污染调查报告文本信息抽取中的对比研究[J].环境科学研究,2024,37(3):607-615.
4梁静桦,梁杰文.基于DeepLabV3+的遥感图像语义分割方法[J].北京测绘,2023,37(12):1596-1600.
5胡志强,李朋骏,王金龙,熊晓芸.基于ChatGPT增强和监督对比学习的政策工具归类研究[J].计算机工程与应用,2024,60(7):292-305. 被引量：1
6张银胜,吉茹,童俊毅,杨宇龙,胡宇翔,单慧琳.基于双模态高效特征学习的高分辨率遥感图像分割[J].遥感学报,2024,28(2):481-493.
7陈晗,何鸿添,雷印杰.基于距离视图表示与逐点细化结合的点云语义分割方法[J].现代计算机,2024,30(4):16-22.
8王焕景,魏江明,费建翔.深度问题解决能力:概念特征、理论框架及培养路径--基于AIGC技术赋能视角[J].中国电化教育,2024(5):97-104. 被引量：4
9张亮.关于“ChatGPT”的历史唯物主义三重审思[J].理论探讨,2024(3):150-159. 被引量：1
10赵凤鸣.ChatGPT对图书馆员专业发展的影响与应对策略[J].江苏科技信息,2024,41(9):34-37.

1夏莉艳.Meta公司员工招聘策略及其启示[J].企业改革与管理,2024(6):99-101.
2李帅,初启凤,袁如金,何月,张开硕,李佳明.SAM在遥感影像通用语义分割中的应用[J].测绘与空间地理信息,2024,47(S01):1-4.
3曾晓茜.核心素养视角下初中历史课堂教学提升策略探究[J].学生·家长·社会,2022(34):0049-0051.
4赵宇杰.建筑工程智慧工地的构建探讨[J].安家,2023(2):0223-0225.
5Yujia Peng,Jiaheng Han,Zhenliang Zhang,Lifeng Fan,Tengyu Liu,Siyuan Qi,Xue Feng,Yuxi Ma,Yizhou Wang,Song-Chun Zhu.通智测试——基于动态具身物理社会交互环境的通用人工智能测试[J].Engineering,2024,34(3):12-22. 被引量：1
6仲崇超,陈布伟.物联网技术在智慧港口安全信息化建设中的应用探究[J].科学与信息化,2024(13):68-70.
7宋义.面向屋顶光伏规划的建筑物屋顶分割方法[J].科学技术创新,2024(15):199-202.
8宋录彪,王笃炎,邢剑卿.基于Flowable的5G业务工单流程可视系统的研究与实践[J].数据通信,2024(3):41-47.
9李竞竹.索绪尔的语言符号观:内涵、原则、特点及影响[J].现代语言学,2024,12(6):466-472.
10韩思齐,陈敏葵,魏丽璞,冉骞,徐谦,余明,孙玉超,陈锋.基于多级联深度学习处理器的机器人手术器械检测和姿态估计算法研究[J].医疗卫生装备,2024,45(6):1-8.

中国图象图形学报

2024年第6期

浏览历史

内容加载中请稍等...

分割一切模型SAM的潜力与展望:综述

参考文献6

二级参考文献15

共引文献31

相关作者

相关机构

相关主题

浏览历史