摘要
为了解决跨模态行人检索从图像和文本中抽取有效的细节特征,以及实现图像与自然语言文本跨模态对齐的问题,提出一种基于多尺度特征增强与对齐的跨模态行人检索模型。该模型引入多模态预训练模型,并构建文本引导的图像掩码建模辅助任务,充分实现跨模态交互,从而无需显式地标注信息即可增强模型学习图像局部细节特征的能力。另外,针对行人图像身份易混淆问题,设计全局图像特征匹配辅助任务,引导模型学习身份关注的视觉特征。在CUHK-PEDES、ICFG-PEDES和RSTPReid等多个公开数据集上的实验结果表明,所提模型超越了目前已有的主流模型,其第一命中率分别达到了72.47%、62.71%和59.25%,实现了高准确率的跨模态行人检索。
In order to solve the problem of extracting effective detail features from images and texts in cross-modal pedestrian retrieval,as well as achieving cross-modal alignment between images and natural language texts,a cross-modal pedestrian retrieval model based on multi-scale feature enhancement and alignment is proposed.In this model,the multimodal pre-training model is introduced,and the text-guided image mask modeling auxiliary task is constructed to fully realize cross-modal interaction,so as to enhance the model's ability to learn local image detail features without explicit annotation information.In allusion to the identity confusion in person images,a global image feature matching auxiliary task is designed to guide the model to learn visual features that are relevant to identity.The experimental results on multiple public datasets such as CUHK-PEDES,ICFG-PEDES,and RSTPReid show that the proposed model surpasses existing mainstream models,with first hit rates of 72.47%,62.71%,and 59.25%,respectively,achieving high accuracy in cross-modal pedestrian retrieval.
作者
徐领
缪翌
张卫锋
XU Ling;MIAO Yi;ZHANG Weifeng(School of Computer Science and Technology,Zhejiang Sci‐Tech University,Hangzhou 310018,China;School of Information Science and Engineering,Jiaxing University,Jiaxing 314001,China)
出处
《现代电子技术》
北大核心
2024年第22期44-50,共7页
Modern Electronics Technique
关键词
跨模态行人检索
多尺度特征增强
多模态对齐
CLIP
图像掩码
跨模态交互
交叉注意力
cross modal pedestrian retrieval
multi-scale feature enhancement
multimodal alignment
CLIP
image mask
cross-modal interaction
cross attention