预训练模型(PTM)通过利用复杂的预训练目标和大量的模型参数,可以有效地获得无标记数据中的丰富知识。而在多模态中,PTM的发展还处于初期。根据具体模态的不同,将目前大多数的多模态PTM分为图像‒文本PTM和视频‒文本PTM;根据数据融合方...预训练模型(PTM)通过利用复杂的预训练目标和大量的模型参数,可以有效地获得无标记数据中的丰富知识。而在多模态中,PTM的发展还处于初期。根据具体模态的不同,将目前大多数的多模态PTM分为图像‒文本PTM和视频‒文本PTM;根据数据融合方式的不同,还可将多模态PTM分为单流模型和双流模型两类。首先,总结了常见的预训练任务和验证实验所使用的下游任务;接着,梳理了目前多模态预训练领域的常见模型,并用表格列出各个模型的下游任务以及模型的性能和实验数据比较;然后,介绍了M6(Multi-Modality to Multi-Modality Multitask Mega-transformer)模型、跨模态提示调优(CPT)模型、VideoBERT(Video Bidirectional Encoder Representations from Transformers)模型和AliceMind(Alibaba’s collection of encoder-decoders from Mind)模型在具体下游任务中的应用场景;最后,总结了多模态PTM相关工作面临的挑战以及未来可能的研究方向。展开更多
Clothing attribute recognition has become an essential technology,which enables users to automatically identify the characteristics of clothes and search for clothing images with similar attributes.However,existing me...Clothing attribute recognition has become an essential technology,which enables users to automatically identify the characteristics of clothes and search for clothing images with similar attributes.However,existing methods cannot recognize newly added attributes and may fail to capture region-level visual features.To address the aforementioned issues,a region-aware fashion contrastive language-image pre-training(RaF-CLIP)model was proposed.This model aligned cropped and segmented images with category and multiple fine-grained attribute texts,achieving the matching of fashion region and corresponding texts through contrastive learning.Clothing retrieval found suitable clothing based on the user-specified clothing categories and attributes,and to further improve the accuracy of retrieval,an attribute-guided composed network(AGCN)as an additional component on RaF-CLIP was introduced,specifically designed for composed image retrieval.This task aimed to modify the reference image based on textual expressions to retrieve the expected target.By adopting a transformer-based bidirectional attention and gating mechanism,it realized the fusion and selection of image features and attribute text features.Experimental results show that the proposed model achieves a mean precision of 0.6633 for attribute recognition tasks and a recall@10(recall@k is defined as the percentage of correct samples appearing in the top k retrieval results)of 39.18 for composed image retrieval task,satisfying user needs for freely searching for clothing through images and texts.展开更多
文摘预训练模型(PTM)通过利用复杂的预训练目标和大量的模型参数,可以有效地获得无标记数据中的丰富知识。而在多模态中,PTM的发展还处于初期。根据具体模态的不同,将目前大多数的多模态PTM分为图像‒文本PTM和视频‒文本PTM;根据数据融合方式的不同,还可将多模态PTM分为单流模型和双流模型两类。首先,总结了常见的预训练任务和验证实验所使用的下游任务;接着,梳理了目前多模态预训练领域的常见模型,并用表格列出各个模型的下游任务以及模型的性能和实验数据比较;然后,介绍了M6(Multi-Modality to Multi-Modality Multitask Mega-transformer)模型、跨模态提示调优(CPT)模型、VideoBERT(Video Bidirectional Encoder Representations from Transformers)模型和AliceMind(Alibaba’s collection of encoder-decoders from Mind)模型在具体下游任务中的应用场景;最后,总结了多模态PTM相关工作面临的挑战以及未来可能的研究方向。
基金National Natural Science Foundation of China(No.61971121)。
文摘Clothing attribute recognition has become an essential technology,which enables users to automatically identify the characteristics of clothes and search for clothing images with similar attributes.However,existing methods cannot recognize newly added attributes and may fail to capture region-level visual features.To address the aforementioned issues,a region-aware fashion contrastive language-image pre-training(RaF-CLIP)model was proposed.This model aligned cropped and segmented images with category and multiple fine-grained attribute texts,achieving the matching of fashion region and corresponding texts through contrastive learning.Clothing retrieval found suitable clothing based on the user-specified clothing categories and attributes,and to further improve the accuracy of retrieval,an attribute-guided composed network(AGCN)as an additional component on RaF-CLIP was introduced,specifically designed for composed image retrieval.This task aimed to modify the reference image based on textual expressions to retrieve the expected target.By adopting a transformer-based bidirectional attention and gating mechanism,it realized the fusion and selection of image features and attribute text features.Experimental results show that the proposed model achieves a mean precision of 0.6633 for attribute recognition tasks and a recall@10(recall@k is defined as the percentage of correct samples appearing in the top k retrieval results)of 39.18 for composed image retrieval task,satisfying user needs for freely searching for clothing through images and texts.