期刊文献+

语义增强图像-文本预训练模型的零样本三维模型分类

Zero-shot 3D Shape Classification Based on Semantic-enhanced Language-Image Pre-training Model
下载PDF
导出
摘要 目前,基于对比学习的图像-文本预训练模型(CLIP)在零样本3维模型分类任务上表现出了巨大潜力,然而3维模型和文本之间存在巨大的模态鸿沟,影响了分类准确率的进一步提高。针对以上问题,该文提出一种语义增强CLIP的零样本3维模型分类方法。该方法首先将3维模型表示成多视图;然后为了增强零样本学习对未知类别的识别能力,通过视觉语言生成模型获得每张视图及其类别的语义描述性文本,并将其作为视图和类别提示文本之间的语义桥梁,语义描述性文本采用图像字幕和视觉问答两种方式获取;最后微调语义编码器将语义描述性文本具化为类别的语义描述,其拥有丰富的语义信息和较好的可解释性,有效减小了视图和类别提示文本的语义鸿沟。实验表明,该文方法在ModelNet10和ModelNet40数据集上的分类性能优于现有的零样本分类方法。 Currently,the Contrastive Language-Image Pre-training(CLIP)has shown great potential in zeroshot 3D shape classification.However,there is a large modality gap between 3D shapes and texts,which limits further improvement of classification accuracy.To address the problem,a zero-shot 3D shape classification method based on semantic-enhanced CLIP is proposed in this paper.Firstly,3D shapes are represented as views.Then,in order to improve recognition ability of unknown categories in zero-shot learning,the semantic descriptive text of each view and its corresponding category are obtained through a visual language generative model,and it is used as the semantic bridge between views and category prompt texts.The semantic descriptive texts are obtained through image captioning and visual question answering.Finally,the finelyadjusted semantic encoder is used to concretize the semantic descriptive texts to the semantic descriptions of each category,which have rich semantic information and strong interpretability,and effectively reduce the semantic gap between views and category prompt texts.Experiments show that our method outperforms existing zero-shot classification methods on the ModelNet10 and ModelNet40 datasets.
作者 丁博 张立宝 秦健 何勇军 DING Bo;ZHANG Libao;QIN Jian;HE Yongjun(School of Computer Science and Technology,Harbin University of Science and Technology,Harbin 150080,China;School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150006,China)
出处 《电子与信息学报》 EI CAS CSCD 北大核心 2024年第8期3314-3323,共10页 Journal of Electronics & Information Technology
基金 国家自然科学基金(61673142) 黑龙江省自然科学基金(LH2022F029,JQ2019F002)。
关键词 3维模型分类 零样本 基于对比学习的图像-文本预训练模型 语义描述性文本 3D shape classification Zero-shot Contrastive Language-Image Pre-training(CLIP) Semantic descriptive text

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部