摘要
行人重识别是计算机视觉领域中的一个重要研究方向,其目的是在不同的监控摄像头中识别并跟踪同一行人.由于视频帧间存在多种时间关系,从这些关系中可以获取到对象的运动模式以及细粒度特征,因此视频重识别相比图像重识别拥有更丰富的时空线索,也更接近实际应用.问题的关键是如何挖掘这些时空线索作为视频重识别的特征.本文针对视频行人重识别问题,提出了一种基于Transformer的长短期时间关系网络(Long and Short Time Transformer,LSTT).该网络包含长短期时间关系模块,提取重要时序信息并强化特征表示.长期时间关系模块利用记忆线索存储每帧信息,并在每一帧建立全局联系;短期时间关系模块则考虑相邻帧之间交互,学习细粒度目标信息,提高特征表示能力.此外,为了提高模型对不同目标特征的适配性,本文还设计了一个包含不同规格卷积核的多尺度模块.该模块具有多种卷积感受野,能够更全面覆盖目标区域,从而进一步提高模型的泛化性能.在MARS、MARS_DL和iLIDS-VID 3个数据集上的实验结果表明,LSTT模型性能最优.
Person re-identification is an important research direction in the field of computer vision,aiming to identi⁃fy and track the same person across different surveillance cameras.Compared with image-based re-identification methods,the video-based re-identification method has richer temporal and spatial information,making it more efficient in real-world applications.Due to the existence of various temporal relationships between video frames,valuable information such as mo⁃tion patterns and fine-grained features can be obtained.Therefore,how to effectively extract these temporal and spatial clues has become a key issue in video-based re-identification.In this paper,a long and short time Transformer(LSTT)network based on a temporal relationship is proposed to address the video-based person re-identification problem.The module in⁃cludes long and short term relationship modules to extract important temporal information and enhance feature representa⁃tion.The long-term relationship module stores information for each frame using a memory cue and establishes global con⁃nections for each video frame.The short-term relationship module considers interaction between adjacent frames to learn fine-grained target information and improve feature representation.Additionally,to improve the model’s adaptability to dif⁃ferent target features,a multi-scale module with convolution kernels of different sizes is designed.The module has multiple convolution receptive fields and can more comprehensively cover the target area,further improving the model’s generaliza⁃tion performance.Experimental results on three datasets,namely MARS,MARS_DL,and iLIDS-VID,demonstrate that the LSTT model achieves state-of-the-art performance.
作者
何智敏
钱江波
严迪群
叶绪伦
王翀
HE Zhi-min;QIAN Jiang-bo;YAN Di-qun;YE Xu-lun;WANG Chong(Faculty of Electrical Engineering and Computer Science,Ningbo University,Ningbo,Zhejiang 315211,China;Zhejiang Key Laboratory of Mobile Network Application Technology,Ningbo,Zhejiang 315211,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2024年第8期2746-2757,共12页
Acta Electronica Sinica
基金
国家自然科学基金(No.62271274)
宁波市科技项目(No.2024Z004,No.2023Z059)。
关键词
视频行人重识别
TRANSFORMER
长期时间关系
短期时间关系
多尺度
video-based person re-identification
Transformer
the long-term temporal relationship
the short-term temporal relationship
multi-scale module