期刊文献+

基于多模态特征频域融合的零样本指称图像分割

Zero-shot referring image segmentation based onmultimodal feature frequency domain fusion
下载PDF
导出
摘要 为了解决语义分割应用到现实世界的下游任务时无法处理未定义类别的问题,提出了指称图像分割任务。该任务根据自然语言文本的描述找到图像中对应的目标。现有方法大多使用一个跨模态解码器来融合从视觉编码器和语言编码器中独立提取的特征,但是这种方法无法有效利用图像的边缘特征且训练复杂。CLIP(contrastive language-image pre-training)是一个强大的预训练视觉语言跨模态模型,能够有效提取图像与文本特征,因此提出一种在频域融合CLIP编码后的多模态特征方法。首先,使用无监督模型对图像进行粗粒度分割,并提取自然语言文本中的名词用于后续任务;接着利用CLIP的图像编码器与文本编码器分别对图像与文本进行编码;然后使用小波变换分解图像与文本特征,可以充分利用图像的边缘特征与图像内的位置信息在频域进行分解并融合,并在频域分别对图像特征与文本特征进行融合,并将融合后的特征进行反变换;最后将文本特征与图像特征进行逐像素匹配,得到分割结果,并在常用的数据集上进行了测试。实验结果证明,网络在无训练零样本的条件下取得了良好的效果,并且具有较好的鲁棒性与泛化能力。 In order to solve the problem that semantic segmentation cannot handle undefined categories when applied to downstream tasks in the real world,it proposed referring image segmentation to find the corresponding target in the image according to the description of natural language text.Most of the existing methods use a cross-modal decoder to fuse the features extracted independently from the visual encoder and language encoder,but these methods cannot effectively utilize the edge features of the image and are complicated to train.CLIP is a powerful pre-trained visual language cross-modal model that can effectively extract image and text features.Therefore,this paper proposed a method of multimodal feature fusion in the frequency domain after CLIP encoding.Firstly,it used an unsupervised model to segment images,and extracted nouns in natural language text for follow-up task.Then it used the image encoder and text encoder of CLIP to encode the image and text respectively.Then it used the wavelet transform to decompose the image and text features,and decomposed and fused in the frequency domain which could make full use of the edge features of the image and the position information in the image,fused the image feature and text feature respectively in the frequency domain,then inversed the fused features.Finally,it matched the text features and image features pixel by pixel,and obtained the segmentation results,and tested on commonly used data sets.The experimental results prove that the network has achieved good results without training zero samples,and has good robustness and generalization ability.
作者 林浩然 刘春黔 薛榕融 谢勋伟 雷印杰 Lin Haoran;Liu Chunqian;Xue Rongrong;Xie Xunwei;Lei Yinjie(School of Electronic Information,Sichuan University,Chengdu 610065,China;Key Laboratory of Optical Engineering,Institute of Optics&Electronics,Chinese Academy of Sciences,Chengdu 610209,China;CETC Key Laboratory of Avionic Information System Technology,The 10th Research Institute of China Electronics Technology Group Corporation,Chengdu 610036,China)
出处 《计算机应用研究》 CSCD 北大核心 2024年第5期1562-1568,共7页 Application Research of Computers
基金 国家自然科学基金资助项目(62276176)。
关键词 指称图像分割 CLIP 小波变换 零样本 referring image segmentation CLIP wavelet transform zero-shot
  • 相关文献

参考文献4

二级参考文献15

共引文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部