Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model compl...Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model complexity will grow quadratically with the number of input images.To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer‐based trackers,we propose a dual pooling transformer tracking framework,dubbed as DPT,which consists of three components:a simple yet efficient spatiotemporal attention model(SAM),a mutual correlation pooling Trans-former(MCPT)and a multiscale aggregation pooling Transformer(MAPT).SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi‐frame templates along space‐time dimensions.MCPT aims to capture multi‐scale pooled and correlated contextual features,which is followed by MAPT that aggregates multi‐scale features into a unified feature representation for tracking prediction.DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on Track-ingNet while maintaining a shorter sequence length of attention tokens,fewer parameters and FLOPs compared to existing state‐of‐the‐art(SOTA)Transformer tracking methods.Extensive experiments demonstrate that DPT tracker yields a strong real‐time tracking baseline with a good trade‐off between tracking performance and inference efficiency.展开更多
In this paper we address the problem of tracking human poses in multiple perspective scales in 2D monocular images/videos. In most state-of-the-art 2D tracking approaches, the issue of scale variation is rarely discus...In this paper we address the problem of tracking human poses in multiple perspective scales in 2D monocular images/videos. In most state-of-the-art 2D tracking approaches, the issue of scale variation is rarely discussed. However in reality, videos often contain human motion with dynamically changed scales. In this paper we propose a tracking framework that can deal with this problem. A scale checking and adjusting algorithm is proposed to automatically adjust the perspective scales during the tracking process. Two metrics are proposed for detecting and adjusting the scale change. One metric is from the height value of the tracked target, which is suitable for some sequences where the tracked target is upright and with no limbs stretching. The other metric employed in this algorithm is more generic, which is invariant to motion types. It is the ratio between the pixel counts of the target silhouette and the detected bounding boxes of the target body. The proposed algorithm is tested on the publicly available datasets (HumanEva). The experimental results show that our method demonstrated higher accuracy and efficiency compared to state-of-the-art approaches.展开更多
基金the National Natural Science Foundation of China,Grant/Award Number:62006065the Science and Technology Research Program of Chongqing Municipal Education Commission,Grant/Award Number:KJQN202100634+1 种基金the Natural Science Foundation of Chongqing,Grant/Award Number:CSTB2022NSCQ‐MSX1202Chongqing Municipal Education Commission,Grant/Award Number:KJQN202100634。
文摘Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model complexity will grow quadratically with the number of input images.To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer‐based trackers,we propose a dual pooling transformer tracking framework,dubbed as DPT,which consists of three components:a simple yet efficient spatiotemporal attention model(SAM),a mutual correlation pooling Trans-former(MCPT)and a multiscale aggregation pooling Transformer(MAPT).SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi‐frame templates along space‐time dimensions.MCPT aims to capture multi‐scale pooled and correlated contextual features,which is followed by MAPT that aggregates multi‐scale features into a unified feature representation for tracking prediction.DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on Track-ingNet while maintaining a shorter sequence length of attention tokens,fewer parameters and FLOPs compared to existing state‐of‐the‐art(SOTA)Transformer tracking methods.Extensive experiments demonstrate that DPT tracker yields a strong real‐time tracking baseline with a good trade‐off between tracking performance and inference efficiency.
文摘In this paper we address the problem of tracking human poses in multiple perspective scales in 2D monocular images/videos. In most state-of-the-art 2D tracking approaches, the issue of scale variation is rarely discussed. However in reality, videos often contain human motion with dynamically changed scales. In this paper we propose a tracking framework that can deal with this problem. A scale checking and adjusting algorithm is proposed to automatically adjust the perspective scales during the tracking process. Two metrics are proposed for detecting and adjusting the scale change. One metric is from the height value of the tracked target, which is suitable for some sequences where the tracked target is upright and with no limbs stretching. The other metric employed in this algorithm is more generic, which is invariant to motion types. It is the ratio between the pixel counts of the target silhouette and the detected bounding boxes of the target body. The proposed algorithm is tested on the publicly available datasets (HumanEva). The experimental results show that our method demonstrated higher accuracy and efficiency compared to state-of-the-art approaches.