摘要
深度学习技术促使目标跟踪领域得到了飞速发展,但有限的标注数据限制了深度模型的高效训练.因此,自监督学习应用于目标跟踪领域来解决模型训练需要大量标注数据的问题.然而,现有基于自监督学习的跟踪器大多提取目标浅层信息,缺乏对目标关键特征的高效表达,且忽视了因目标遮挡等挑战导致的反向验证难度大的问题,致使跟踪精度下降.为解决上述问题,本文提出一种基于多帧一致性修正的自监督孪生网络跟踪方法,由前向多帧反序验证策略、混序修正模块和视觉特征增强模块三部分共同构成.首先,前向多帧反序验证策略从多条路径中自适应选择最优目标轨迹来构造循环一致性损失优化函数,面对目标遮挡、背景干扰、形变等挑战时能够合理规划路径.其次,针对多条路径对同一帧目标预测位置的不一致问题,提出混序修正模块来修正跟踪偏移,增强了前向跟踪时特征提取网络的鲁棒性.此外,视觉特征增强模块通过自适应加权融合目标的全局上下文信息与局部语义特征信息,增强了模型对目标自身特征的表达能力.最后,本文方法在OTB2013、OTB2015、TColor-128和VOT-2018四个公开数据集上进行了验证.实验结果表明:在光照、形变、背景干扰等复杂场景下,相比于现有21种主流跟踪算法,本文方法在四个数据集上的精确度平均提高了4.6%,比基于自/无监督学习的跟踪器平均提高了5.8%的精确度.
Visual object tracking is an important yet challenging task in computer vision with a wide range of applications,such as video surveillance,robotics,action recognition,scene understanding,intelligent transportation,visual navigation,and human-machine interaction,etc.It aims to estimate the state of an arbitrary object in video frames,given the object bounding box in an initial frame.In recent years,deep learning technology has promoted the rapid development in the object tracking field,numerous visual tracking methods based on deep learning have made great progress,especially for Siamese trackers which aim to learn a decision making-based similarity evaluation.Nevertheless,the insufficient labeled data limits the efficient training of deep network model.Therefore,self-supervised learning strategy is applied to the object tracking to solve the problem of model training that requires a large number of labeled data.However,the existing self-supervised trackers mostly extract shallow information of the object and lack the efficient representation of key features of the object.In addition,they also ignore the difficulty of reverse verification caused by the challenges such as object occlusion,resulting in a decrease in tracking accuracy.In order to solve the above problems,a multi-frame consistency correction based self-supervised Siamese network tracking method(MCCSST)is proposed in this paper,which consists of a forward multi-frame reverse order verification strategy,a mixed order correction module and visual feature enhancement module.Firstly,the forward multi-frame reverse order verification strategy can adaptively select the optimal tracking trajectory from multiple paths to construct the cycle-consistency loss optimization function,so as to reasonably avoid the challenges of object occlusion,background clutter,deformation and so on.Secondly,for the problem of inconsistent object localization by multiple paths in the same frame,a mixed order correction module is proposed to correct the tracking drift and enhance the robustness of the object feature extraction,which utilizes temporal information of a video to better focus on the object’s own features during the forward tracking.In addition,the visual feature enhancement module,consisting of channel correlation branch,convolution block branch and spatial correlation branch,is utilized to enhance the object features representation ability by adaptively weighted fusing the global context information and local semantic feature information of the object.In order to improve channel category and spatial position information of the object,while suppressing irrelevant background information,we further develop an adaptive feature fusion scheme to fuse multi-dimensional feature maps of three branches.Based on Siamese network architecture,the Discriminant Correlation Filters Network with Vital Feature Enhancement(DCFNet-VFE)is designed as our baseline,and then the object location is achieved through the filter layer.Finally,the proposed method is verified on four public object tracking benchmark datasets:OTB2013,OTB2015,TColor-128 and VOT-2018.The experimental results show that,under the complex scenes(e.g.,illumination,deformation,background interference),the accuracy of the proposed method on the four benchmarks is improved by 4.6%on average over the compared twenty-one state-of-the-art trackers,which is an average of 5.8%higher than that of the self/unsupervised learning-based trackers.
作者
程旭
刘丽华
王莹莹
余梓彤
赵国英
CHENG Xu;LIU Li-Hua;WANG Ying-Ying;YU Zi-Tong;ZHAO Guo-Ying(School of Computer Science,Nanjing University of Information Science and Technology,Nanjing 210044;Engineering Research Center of Digital Forensics,Nanjing University of Information Science and Technology,Nanjing 210044;Center for Machine Vision and Signal Analysis,University of Oulu,Oulu FI-90014,Finland)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第12期2544-2560,共17页
Chinese Journal of Computers
基金
国家自然科学基金(61802058,61911530397)
国家留学基金资助项目(201908320175)
中国博士后科学基金资助项目(2019M651650)
江苏省研究生科研与实践创新计划项目(KYCX22_1220)资助.
关键词
视频监控
目标跟踪
自监督学习
循环一致性损失
视觉注意力机制
surveillance
object tracking
self-supervised learning
cycle-consistency loss
visual attention mechanism