Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model compl...Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model complexity will grow quadratically with the number of input images.To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer‐based trackers,we propose a dual pooling transformer tracking framework,dubbed as DPT,which consists of three components:a simple yet efficient spatiotemporal attention model(SAM),a mutual correlation pooling Trans-former(MCPT)and a multiscale aggregation pooling Transformer(MAPT).SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi‐frame templates along space‐time dimensions.MCPT aims to capture multi‐scale pooled and correlated contextual features,which is followed by MAPT that aggregates multi‐scale features into a unified feature representation for tracking prediction.DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on Track-ingNet while maintaining a shorter sequence length of attention tokens,fewer parameters and FLOPs compared to existing state‐of‐the‐art(SOTA)Transformer tracking methods.Extensive experiments demonstrate that DPT tracker yields a strong real‐time tracking baseline with a good trade‐off between tracking performance and inference efficiency.展开更多
Although previous studies have made some clear leap in learning latent dynamics from high-dimensional representations,the performances in terms of accuracy and inference time of long-term model prediction still need t...Although previous studies have made some clear leap in learning latent dynamics from high-dimensional representations,the performances in terms of accuracy and inference time of long-term model prediction still need to be improved.In this study,a deep convolutional network based on the Koopman operator(CKNet)is proposed to model non-linear systems with pixel-level measurements for long-term prediction.CKNet adopts an autoencoder network architecture,consisting of an encoder to generate latent states and a linear dynamical model(i.e.,the Koopman operator)which evolves in the latent state space spanned by the encoder.The decoder is used to recover images from latent states.According to a multi-step ahead prediction loss function,the system matrices for approximating the Koopman operator are trained synchronously with the autoencoder in a mini-batch manner.In this manner,gradients can be synchronously transmitted to both the system matrices and the autoencoder to help the encoder self-adaptively tune the latent state space in the training process,and the resulting model is time-invariant in the latent space.Therefore,the proposed CKNet has the advantages of less inference time and high accuracy for long-term prediction.Experiments are per-formed on OpenAI Gym and Mujoco environments,including two and four non-linear forced dynamical systems with continuous action spaces.The experimental results show that CKNet has strong long-term prediction capabilities with sufficient precision.展开更多
基金the National Natural Science Foundation of China,Grant/Award Number:62006065the Science and Technology Research Program of Chongqing Municipal Education Commission,Grant/Award Number:KJQN202100634+1 种基金the Natural Science Foundation of Chongqing,Grant/Award Number:CSTB2022NSCQ‐MSX1202Chongqing Municipal Education Commission,Grant/Award Number:KJQN202100634。
文摘Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model complexity will grow quadratically with the number of input images.To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer‐based trackers,we propose a dual pooling transformer tracking framework,dubbed as DPT,which consists of three components:a simple yet efficient spatiotemporal attention model(SAM),a mutual correlation pooling Trans-former(MCPT)and a multiscale aggregation pooling Transformer(MAPT).SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi‐frame templates along space‐time dimensions.MCPT aims to capture multi‐scale pooled and correlated contextual features,which is followed by MAPT that aggregates multi‐scale features into a unified feature representation for tracking prediction.DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on Track-ingNet while maintaining a shorter sequence length of attention tokens,fewer parameters and FLOPs compared to existing state‐of‐the‐art(SOTA)Transformer tracking methods.Extensive experiments demonstrate that DPT tracker yields a strong real‐time tracking baseline with a good trade‐off between tracking performance and inference efficiency.
基金National Natural Science Foundation of China,Grant/Award Numbers:61825305,62003361,U21A20518China Postdoctoral Science Foundation,Grant/Award Number:47680。
文摘Although previous studies have made some clear leap in learning latent dynamics from high-dimensional representations,the performances in terms of accuracy and inference time of long-term model prediction still need to be improved.In this study,a deep convolutional network based on the Koopman operator(CKNet)is proposed to model non-linear systems with pixel-level measurements for long-term prediction.CKNet adopts an autoencoder network architecture,consisting of an encoder to generate latent states and a linear dynamical model(i.e.,the Koopman operator)which evolves in the latent state space spanned by the encoder.The decoder is used to recover images from latent states.According to a multi-step ahead prediction loss function,the system matrices for approximating the Koopman operator are trained synchronously with the autoencoder in a mini-batch manner.In this manner,gradients can be synchronously transmitted to both the system matrices and the autoencoder to help the encoder self-adaptively tune the latent state space in the training process,and the resulting model is time-invariant in the latent space.Therefore,the proposed CKNet has the advantages of less inference time and high accuracy for long-term prediction.Experiments are per-formed on OpenAI Gym and Mujoco environments,including two and four non-linear forced dynamical systems with continuous action spaces.The experimental results show that CKNet has strong long-term prediction capabilities with sufficient precision.