Visual object tracking plays a crucial role in computer vision.In recent years,researchers have proposed various methods to achieve high-performance object tracking.Among these,methods based on Transformers have becom...Visual object tracking plays a crucial role in computer vision.In recent years,researchers have proposed various methods to achieve high-performance object tracking.Among these,methods based on Transformers have become a research hotspot due to their ability to globally model and contextualize information.However,current Transformer-based object tracking methods still face challenges such as low tracking accuracy and the presence of redundant feature information.In this paper,we introduce self-calibration multi-head self-attention Transformer(SMSTracker)as a solution to these challenges.It employs a hybrid tensor decomposition self-organizing multihead self-attention transformermechanism,which not only compresses and accelerates Transformer operations but also significantly reduces redundant data,thereby enhancing the accuracy and efficiency of tracking.Additionally,we introduce a self-calibration attention fusion block to resolve common issues of attention ambiguities and inconsistencies found in traditional trackingmethods,ensuring the stability and reliability of tracking performance across various scenarios.By integrating a hybrid tensor decomposition approach with a self-organizingmulti-head self-attentive transformer mechanism,SMSTracker enhances the efficiency and accuracy of the tracking process.Experimental results show that SMSTracker achieves competitive performance in visual object tracking,promising more robust and efficient tracking systems,demonstrating its potential to providemore robust and efficient tracking solutions in real-world applications.展开更多
There are two main trends in the development of unmanned aerial vehicle(UAV)technologies:miniaturization and intellectualization,in which realizing object tracking capabilities for a nano-scale UAV is one of the most ...There are two main trends in the development of unmanned aerial vehicle(UAV)technologies:miniaturization and intellectualization,in which realizing object tracking capabilities for a nano-scale UAV is one of the most challenging problems.In this paper,we present a visual object tracking and servoing control system utilizing a tailor-made 38 g nano-scale quadrotor.A lightweight visual module is integrated to enable object tracking capabilities,and a micro positioning deck is mounted to provide accurate pose estimation.In order to be robust against object appearance variations,a novel object tracking algorithm,denoted by RMCTer,is proposed,which integrates a powerful short-term tracking module and an efficient long-term processing module.In particular,the long-term processing module can provide additional object information and modify the short-term tracking model in a timely manner.Furthermore,a positionbased visual servoing control method is proposed for the quadrotor,where an adaptive tracking controller is designed by leveraging backstepping and adaptive techniques.Stable and accurate object tracking is achieved even under disturbances.Experimental results are presented to demonstrate the high accuracy and stability of the whole tracking system.展开更多
The performance and accuracy of computer vision systems are affected by noise in different forms.Although numerous solutions and algorithms have been presented for dealing with every type of noise,a comprehensive tech...The performance and accuracy of computer vision systems are affected by noise in different forms.Although numerous solutions and algorithms have been presented for dealing with every type of noise,a comprehensive technique that can cover all the diverse noises and mitigate their damaging effects on the performance and precision of various systems is still missing.In this paper,we have focused on the stability and robustness of one computer vision branch(i.e.,visual object tracking).We have demonstrated that,without imposing a heavy computational load on a model or changing its algorithms,the drop in the performance and accuracy of a system when it is exposed to an unseen noise-laden test dataset can be prevented by simply applying the style transfer technique on the train dataset and training the model with a combination of these and the original untrained data.To verify our proposed approach,it is applied on a generic object tracker by using regression networks.This method’s validity is confirmed by testing it on an exclusive benchmark comprising 50 image sequences,with each sequence containing 15 types of noise at five different intensity levels.The OPE curves obtained show a 40%increase in the robustness of the proposed object tracker against noise,compared to the other trackers considered.展开更多
Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of ...Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.展开更多
Visual object tracking (VOT) is an important sub- field of computer vision. It has widespread application do- mains, and has been considered as an important part of surveillance and security system. VOA facilitates ...Visual object tracking (VOT) is an important sub- field of computer vision. It has widespread application do- mains, and has been considered as an important part of surveillance and security system. VOA facilitates finding the position of target in image coordinates of video frames. While doing this, VOA also faces many challenges such as noise, clutter, occlusion, rapid change in object appearances, highly maneuvered (complex) object motion, illumination changes. In recent years, VOT has made significant progress due to availability of low-cost high-quality video cameras as well as fast computational resources, and many modern techniques have been proposed to handle the challenges faced by VOT. This article introduces the readers to 1) VOT and its applica- tions in other domains, 2) different issues which arise in it, 3) various classical as well as contemporary approaches for object tracking, 4) evaluation methodologies for VOT, and 5) online resources, i.e., annotated datasets and source code available for various tracking techniques.展开更多
In the context of multiple-target tracking and surveillance applications,this paper investigates the challenge of determining the optimal positioning of a single autonomous aerial vehicle or agent equipped with multip...In the context of multiple-target tracking and surveillance applications,this paper investigates the challenge of determining the optimal positioning of a single autonomous aerial vehicle or agent equipped with multiple independently-steerable zooming cameras to effectively monitor a set of targets of interest.Each camera is dedicated to tracking a specific target or cluster of targets.The key innovation of this study,in comparison to existing approaches,lies in incorporating the zooming factor for the onboard cameras into the optimization problem.This enhancement offers greater flexibility during mission execution by allowing the autonomous agent to adjust the focal lengths of the onboard cameras,in exchange for varying real-world distances to the corresponding targets,thereby providing additional degrees of freedom to the optimization problem.The proposed optimization framework aims to strike a balance among various factors,including distance to the targets,verticality of viewpoints,and the required focal length for each camera.The primary focus of this paper is to establish the theoretical groundwork for addressing the non-convex nature of the optimization problem arising from these considerations.To this end,we develop an original convex approximation strategy.The paper also includes simulations of diverse scenarios,featuring varying numbers of onboard tracking cameras and target motion profiles,to validate the effectiveness of the proposed approach.展开更多
As a fundamental task in computer vision,visual object tracking has received much attention in recent years.Most studies focus on short-term visual tracking which addresses shorter videos and always-visible targets.Ho...As a fundamental task in computer vision,visual object tracking has received much attention in recent years.Most studies focus on short-term visual tracking which addresses shorter videos and always-visible targets.However,long-term visual tracking is much closer to practical applications with more complicated challenges.There exists a longer duration such as minute-level or even hour-level in the long-term tracking task,and the task also needs to handle more frequent target disappearance and reappearance.In this paper,we provide a thorough review of long-term tracking,summarizing long-term tracking algorithms from two perspectives:framework architectures and utilization of intermediate tracking results.Then we provide a detailed description of existing benchmarks and corresponding evaluation protocols.Furthermore,we conduct extensive experiments and analyse the performance of trackers on six benchmarks:VOTLT2018,VOTLT2019(2020/2021),OxUvA,LaSOT,TLP and the long-term subset of VTUAV-V.Finally,we discuss the future prospects from multiple perspectives,including algorithm design and benchmark construction.To our knowledge,this is the first comprehensive survey for long-term visual object tracking.The relevant content is available at https://github.com/wangdongdut/Long-term-Visual-Tracking.展开更多
Teleoperation is of great importance in the area of robotics,especially when people are unavailable in the robot workshop.It provides a way for people to control robots remotely using human intelligence.In this paper,...Teleoperation is of great importance in the area of robotics,especially when people are unavailable in the robot workshop.It provides a way for people to control robots remotely using human intelligence.In this paper,a robotic teleoperation system for precise robotic manipulation is established.The data glove and the 7-degrees of freedom(DOFs)force feedback controller are used for the remote control interaction.The control system and the monitor system are designed for the remote precise manipulation.The monitor system contains an image acquisition system and a human-machine interaction module,and aims to simulate and detect the robot running state.Besides,a visual object tracking algorithm is developed to estimate the states of the dynamic system from noisy observations.The established robotic teleoperation systemis applied to a series of experiments,and high-precision results are obtained,showing the effectiveness of the physical system.展开更多
基金supported by the National Natural Science Foundation of China under Grant 62177029the Postgraduate Research&Practice Innovation Program of Jiangsu Province(KYCX21_0740),China.
文摘Visual object tracking plays a crucial role in computer vision.In recent years,researchers have proposed various methods to achieve high-performance object tracking.Among these,methods based on Transformers have become a research hotspot due to their ability to globally model and contextualize information.However,current Transformer-based object tracking methods still face challenges such as low tracking accuracy and the presence of redundant feature information.In this paper,we introduce self-calibration multi-head self-attention Transformer(SMSTracker)as a solution to these challenges.It employs a hybrid tensor decomposition self-organizing multihead self-attention transformermechanism,which not only compresses and accelerates Transformer operations but also significantly reduces redundant data,thereby enhancing the accuracy and efficiency of tracking.Additionally,we introduce a self-calibration attention fusion block to resolve common issues of attention ambiguities and inconsistencies found in traditional trackingmethods,ensuring the stability and reliability of tracking performance across various scenarios.By integrating a hybrid tensor decomposition approach with a self-organizingmulti-head self-attentive transformer mechanism,SMSTracker enhances the efficiency and accuracy of the tracking process.Experimental results show that SMSTracker achieves competitive performance in visual object tracking,promising more robust and efficient tracking systems,demonstrating its potential to providemore robust and efficient tracking solutions in real-world applications.
基金supported in part by the Institute for Guo Qiang of Tsinghua University(2019GQG1023)in part by Graduate Education and Teaching Reform Project of Tsinghua University(202007J007)+1 种基金in part by National Natural Science Foundation of China(U19B2029,62073028,61803222)in part by the Independent Research Program of Tsinghua University(2018Z05JDX002)。
文摘There are two main trends in the development of unmanned aerial vehicle(UAV)technologies:miniaturization and intellectualization,in which realizing object tracking capabilities for a nano-scale UAV is one of the most challenging problems.In this paper,we present a visual object tracking and servoing control system utilizing a tailor-made 38 g nano-scale quadrotor.A lightweight visual module is integrated to enable object tracking capabilities,and a micro positioning deck is mounted to provide accurate pose estimation.In order to be robust against object appearance variations,a novel object tracking algorithm,denoted by RMCTer,is proposed,which integrates a powerful short-term tracking module and an efficient long-term processing module.In particular,the long-term processing module can provide additional object information and modify the short-term tracking model in a timely manner.Furthermore,a positionbased visual servoing control method is proposed for the quadrotor,where an adaptive tracking controller is designed by leveraging backstepping and adaptive techniques.Stable and accurate object tracking is achieved even under disturbances.Experimental results are presented to demonstrate the high accuracy and stability of the whole tracking system.
文摘The performance and accuracy of computer vision systems are affected by noise in different forms.Although numerous solutions and algorithms have been presented for dealing with every type of noise,a comprehensive technique that can cover all the diverse noises and mitigate their damaging effects on the performance and precision of various systems is still missing.In this paper,we have focused on the stability and robustness of one computer vision branch(i.e.,visual object tracking).We have demonstrated that,without imposing a heavy computational load on a model or changing its algorithms,the drop in the performance and accuracy of a system when it is exposed to an unseen noise-laden test dataset can be prevented by simply applying the style transfer technique on the train dataset and training the model with a combination of these and the original untrained data.To verify our proposed approach,it is applied on a generic object tracker by using regression networks.This method’s validity is confirmed by testing it on an exclusive benchmark comprising 50 image sequences,with each sequence containing 15 types of noise at five different intensity levels.The OPE curves obtained show a 40%increase in the robustness of the proposed object tracker against noise,compared to the other trackers considered.
基金supported in part by National Natural Science Foundation of China(No.62176041)in part by Excellent Science and Technique Talent Foundation of Dalian(No.2022RY21).
文摘Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.
文摘Visual object tracking (VOT) is an important sub- field of computer vision. It has widespread application do- mains, and has been considered as an important part of surveillance and security system. VOA facilitates finding the position of target in image coordinates of video frames. While doing this, VOA also faces many challenges such as noise, clutter, occlusion, rapid change in object appearances, highly maneuvered (complex) object motion, illumination changes. In recent years, VOT has made significant progress due to availability of low-cost high-quality video cameras as well as fast computational resources, and many modern techniques have been proposed to handle the challenges faced by VOT. This article introduces the readers to 1) VOT and its applica- tions in other domains, 2) different issues which arise in it, 3) various classical as well as contemporary approaches for object tracking, 4) evaluation methodologies for VOT, and 5) online resources, i.e., annotated datasets and source code available for various tracking techniques.
基金supported by grants PID2022-142946NA-I00 and PID2022-141159OB-I00funded by MICIU/AEI/10.13039/501100011033ERDF/EU
文摘In the context of multiple-target tracking and surveillance applications,this paper investigates the challenge of determining the optimal positioning of a single autonomous aerial vehicle or agent equipped with multiple independently-steerable zooming cameras to effectively monitor a set of targets of interest.Each camera is dedicated to tracking a specific target or cluster of targets.The key innovation of this study,in comparison to existing approaches,lies in incorporating the zooming factor for the onboard cameras into the optimization problem.This enhancement offers greater flexibility during mission execution by allowing the autonomous agent to adjust the focal lengths of the onboard cameras,in exchange for varying real-world distances to the corresponding targets,thereby providing additional degrees of freedom to the optimization problem.The proposed optimization framework aims to strike a balance among various factors,including distance to the targets,verticality of viewpoints,and the required focal length for each camera.The primary focus of this paper is to establish the theoretical groundwork for addressing the non-convex nature of the optimization problem arising from these considerations.To this end,we develop an original convex approximation strategy.The paper also includes simulations of diverse scenarios,featuring varying numbers of onboard tracking cameras and target motion profiles,to validate the effectiveness of the proposed approach.
基金supported by National Natural Science Foundation of China(Nos.62176041 and 62022021)Joint Fund of Ministry of Education for Equipment Preresearch,China(No.8091B032155)+1 种基金the Science and Technology Innovation Foundation of Dalian,China(No.2020 JJ26GX036)the Fundamental Research Funds for the Central Universities,China(No.DUT21LAB127).
文摘As a fundamental task in computer vision,visual object tracking has received much attention in recent years.Most studies focus on short-term visual tracking which addresses shorter videos and always-visible targets.However,long-term visual tracking is much closer to practical applications with more complicated challenges.There exists a longer duration such as minute-level or even hour-level in the long-term tracking task,and the task also needs to handle more frequent target disappearance and reappearance.In this paper,we provide a thorough review of long-term tracking,summarizing long-term tracking algorithms from two perspectives:framework architectures and utilization of intermediate tracking results.Then we provide a detailed description of existing benchmarks and corresponding evaluation protocols.Furthermore,we conduct extensive experiments and analyse the performance of trackers on six benchmarks:VOTLT2018,VOTLT2019(2020/2021),OxUvA,LaSOT,TLP and the long-term subset of VTUAV-V.Finally,we discuss the future prospects from multiple perspectives,including algorithm design and benchmark construction.To our knowledge,this is the first comprehensive survey for long-term visual object tracking.The relevant content is available at https://github.com/wangdongdut/Long-term-Visual-Tracking.
基金NSFC-Shenzhen Robotics Research Center Project(No.U2013207)the Beijing Science and Technology Plan Project(No.Z191100008019008)。
文摘Teleoperation is of great importance in the area of robotics,especially when people are unavailable in the robot workshop.It provides a way for people to control robots remotely using human intelligence.In this paper,a robotic teleoperation system for precise robotic manipulation is established.The data glove and the 7-degrees of freedom(DOFs)force feedback controller are used for the remote control interaction.The control system and the monitor system are designed for the remote precise manipulation.The monitor system contains an image acquisition system and a human-machine interaction module,and aims to simulate and detect the robot running state.Besides,a visual object tracking algorithm is developed to estimate the states of the dynamic system from noisy observations.The established robotic teleoperation systemis applied to a series of experiments,and high-precision results are obtained,showing the effectiveness of the physical system.