There are two main trends in the development of unmanned aerial vehicle(UAV)technologies:miniaturization and intellectualization,in which realizing object tracking capabilities for a nano-scale UAV is one of the most ...There are two main trends in the development of unmanned aerial vehicle(UAV)technologies:miniaturization and intellectualization,in which realizing object tracking capabilities for a nano-scale UAV is one of the most challenging problems.In this paper,we present a visual object tracking and servoing control system utilizing a tailor-made 38 g nano-scale quadrotor.A lightweight visual module is integrated to enable object tracking capabilities,and a micro positioning deck is mounted to provide accurate pose estimation.In order to be robust against object appearance variations,a novel object tracking algorithm,denoted by RMCTer,is proposed,which integrates a powerful short-term tracking module and an efficient long-term processing module.In particular,the long-term processing module can provide additional object information and modify the short-term tracking model in a timely manner.Furthermore,a positionbased visual servoing control method is proposed for the quadrotor,where an adaptive tracking controller is designed by leveraging backstepping and adaptive techniques.Stable and accurate object tracking is achieved even under disturbances.Experimental results are presented to demonstrate the high accuracy and stability of the whole tracking system.展开更多
Inspired by human behaviors, a robot object tracking model is proposed on the basis of visual attention mechanism, which is fit for the theory of topological perception. The model integrates the image-driven, bottom-u...Inspired by human behaviors, a robot object tracking model is proposed on the basis of visual attention mechanism, which is fit for the theory of topological perception. The model integrates the image-driven, bottom-up attention and the object-driven, top-down attention, whereas the previous attention model has mostly focused on either the bottom-up or top-down attention. By the bottom-up component, the whole scene is segmented into the ground region and the salient regions. Guided by top-down strategy which is achieved by a topological graph, the object regions are separated from the salient regions. The salient regions except the object regions are the barrier regions. In order to estimate the model, a mobile robot platform is developed, on which some experiments are implemented. The experimental results indicate that processing an image with a resolution of 752 × 480 pixels takes less than 200 ms and the object regions are unabridged. The analysis obtained by comparing the proposed model with the existing model demonstrates that the proposed model has some advantages in robot object tracking in terms of speed and efficiency.展开更多
An object model-based tracking method is useful for tracking multiple objects, but the main difficulties are modeling objects reliably and tracking objects via models in successive frames. An effective tracking method...An object model-based tracking method is useful for tracking multiple objects, but the main difficulties are modeling objects reliably and tracking objects via models in successive frames. An effective tracking method using the object models is proposed to track multiple objects in a real-time visual surveillance system. Firstly, for detecting objects, an adaptive kernel density estimation method is utilized, which uses an adaptive bandwidth and features combining colour and gradient. Secondly, some models of objects are built for describing motion, shape and colour features. Then, a matching matrix is formed to analyze tracking situations. If objects are tracked under occlusions, the optimal "visual" object is found to represent the occluded object, and the posterior probability of pixel is used to determine which pixel is utilized for updating object models. Extensive experiments show that this method improves the accuracy and validity of tracking objects even under occlusions and is used in real-time visual surveillance systems.展开更多
The performance and accuracy of computer vision systems are affected by noise in different forms.Although numerous solutions and algorithms have been presented for dealing with every type of noise,a comprehensive tech...The performance and accuracy of computer vision systems are affected by noise in different forms.Although numerous solutions and algorithms have been presented for dealing with every type of noise,a comprehensive technique that can cover all the diverse noises and mitigate their damaging effects on the performance and precision of various systems is still missing.In this paper,we have focused on the stability and robustness of one computer vision branch(i.e.,visual object tracking).We have demonstrated that,without imposing a heavy computational load on a model or changing its algorithms,the drop in the performance and accuracy of a system when it is exposed to an unseen noise-laden test dataset can be prevented by simply applying the style transfer technique on the train dataset and training the model with a combination of these and the original untrained data.To verify our proposed approach,it is applied on a generic object tracker by using regression networks.This method’s validity is confirmed by testing it on an exclusive benchmark comprising 50 image sequences,with each sequence containing 15 types of noise at five different intensity levels.The OPE curves obtained show a 40%increase in the robustness of the proposed object tracker against noise,compared to the other trackers considered.展开更多
The 3D object visual tracking problem is studied for the robot vision system of the 220kV/330kV high-voltage live-line insulator cleaning robot. The SUSAN Edge based Scale Invariant Feature (SESIF) algorithm based 3D ...The 3D object visual tracking problem is studied for the robot vision system of the 220kV/330kV high-voltage live-line insulator cleaning robot. The SUSAN Edge based Scale Invariant Feature (SESIF) algorithm based 3D objects visual tracking is achieved in three stages: the first frame stage,tracking stage,and recovering stage. An SESIF based objects recognition algorithm is proposed to find initial location at both the first frame stage and recovering stage. An SESIF and Lie group based visual tracking algorithm is used to track 3D object. Experiments verify the algorithm's robustness. This algorithm will be used in the second generation of the 220kV/330kV high-voltage live-line insulator cleaning robot.展开更多
In this paper, we have presented a novel tracking method aiming at detecting objects and maintaining their la-bel/identification over the time. The key factors of this method are to use depth information and different...In this paper, we have presented a novel tracking method aiming at detecting objects and maintaining their la-bel/identification over the time. The key factors of this method are to use depth information and different strategies to track objects under various occlusion scenarios. The foreground objects are detected and refined by background subtraction and shadow cancellation. The occlusion detection is based on information of foreground blobs in successive frames. The occlusion regions are projected to the projection plane XZ to analysis occlusion situation. According to the occlusion analysis results, different objects’ corresponding strategies are introduced to track objects under various occlusion scenarios including tracking occluded objects in similar depth layer and in different depth layers. The experimental results show that our proposed method can track the moving objects under the most typical and challenging occlusion scenarios.展开更多
Visual object tracking plays a crucial role in computer vision.In recent years,researchers have proposed various methods to achieve high-performance object tracking.Among these,methods based on Transformers have becom...Visual object tracking plays a crucial role in computer vision.In recent years,researchers have proposed various methods to achieve high-performance object tracking.Among these,methods based on Transformers have become a research hotspot due to their ability to globally model and contextualize information.However,current Transformer-based object tracking methods still face challenges such as low tracking accuracy and the presence of redundant feature information.In this paper,we introduce self-calibration multi-head self-attention Transformer(SMSTracker)as a solution to these challenges.It employs a hybrid tensor decomposition self-organizing multihead self-attention transformermechanism,which not only compresses and accelerates Transformer operations but also significantly reduces redundant data,thereby enhancing the accuracy and efficiency of tracking.Additionally,we introduce a self-calibration attention fusion block to resolve common issues of attention ambiguities and inconsistencies found in traditional trackingmethods,ensuring the stability and reliability of tracking performance across various scenarios.By integrating a hybrid tensor decomposition approach with a self-organizingmulti-head self-attentive transformer mechanism,SMSTracker enhances the efficiency and accuracy of the tracking process.Experimental results show that SMSTracker achieves competitive performance in visual object tracking,promising more robust and efficient tracking systems,demonstrating its potential to providemore robust and efficient tracking solutions in real-world applications.展开更多
Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from mul...Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from multiple modalities to handle specific scenes,with promising research prospects for emerging methods and benchmarks.To provide a thorough review of multi-modal tracking,different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy,with specific focus on visibledepth(RGB-D)and visible-thermal(RGB-T)tracking.Subsequently,a detailed description of the related benchmarks and challenges is provided.Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets:PTB,VOT19-RGBD,GTOT,RGBT234,and VOT19-RGBT.Finally,various future directions,including model design and dataset construction,are discussed from different perspectives for further research.展开更多
In the context of multiple-target tracking and surveillance applications,this paper investigates the challenge of determining the optimal positioning of a single autonomous aerial vehicle or agent equipped with multip...In the context of multiple-target tracking and surveillance applications,this paper investigates the challenge of determining the optimal positioning of a single autonomous aerial vehicle or agent equipped with multiple independently-steerable zooming cameras to effectively monitor a set of targets of interest.Each camera is dedicated to tracking a specific target or cluster of targets.The key innovation of this study,in comparison to existing approaches,lies in incorporating the zooming factor for the onboard cameras into the optimization problem.This enhancement offers greater flexibility during mission execution by allowing the autonomous agent to adjust the focal lengths of the onboard cameras,in exchange for varying real-world distances to the corresponding targets,thereby providing additional degrees of freedom to the optimization problem.The proposed optimization framework aims to strike a balance among various factors,including distance to the targets,verticality of viewpoints,and the required focal length for each camera.The primary focus of this paper is to establish the theoretical groundwork for addressing the non-convex nature of the optimization problem arising from these considerations.To this end,we develop an original convex approximation strategy.The paper also includes simulations of diverse scenarios,featuring varying numbers of onboard tracking cameras and target motion profiles,to validate the effectiveness of the proposed approach.展开更多
Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of ...Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.展开更多
基金supported in part by the Institute for Guo Qiang of Tsinghua University(2019GQG1023)in part by Graduate Education and Teaching Reform Project of Tsinghua University(202007J007)+1 种基金in part by National Natural Science Foundation of China(U19B2029,62073028,61803222)in part by the Independent Research Program of Tsinghua University(2018Z05JDX002)。
文摘There are two main trends in the development of unmanned aerial vehicle(UAV)technologies:miniaturization and intellectualization,in which realizing object tracking capabilities for a nano-scale UAV is one of the most challenging problems.In this paper,we present a visual object tracking and servoing control system utilizing a tailor-made 38 g nano-scale quadrotor.A lightweight visual module is integrated to enable object tracking capabilities,and a micro positioning deck is mounted to provide accurate pose estimation.In order to be robust against object appearance variations,a novel object tracking algorithm,denoted by RMCTer,is proposed,which integrates a powerful short-term tracking module and an efficient long-term processing module.In particular,the long-term processing module can provide additional object information and modify the short-term tracking model in a timely manner.Furthermore,a positionbased visual servoing control method is proposed for the quadrotor,where an adaptive tracking controller is designed by leveraging backstepping and adaptive techniques.Stable and accurate object tracking is achieved even under disturbances.Experimental results are presented to demonstrate the high accuracy and stability of the whole tracking system.
基金supported by National Basic Research Program of China (973 Program) (No. 2006CB300407)National Natural Science Foundation of China (No. 50775017)
文摘Inspired by human behaviors, a robot object tracking model is proposed on the basis of visual attention mechanism, which is fit for the theory of topological perception. The model integrates the image-driven, bottom-up attention and the object-driven, top-down attention, whereas the previous attention model has mostly focused on either the bottom-up or top-down attention. By the bottom-up component, the whole scene is segmented into the ground region and the salient regions. Guided by top-down strategy which is achieved by a topological graph, the object regions are separated from the salient regions. The salient regions except the object regions are the barrier regions. In order to estimate the model, a mobile robot platform is developed, on which some experiments are implemented. The experimental results indicate that processing an image with a resolution of 752 × 480 pixels takes less than 200 ms and the object regions are unabridged. The analysis obtained by comparing the proposed model with the existing model demonstrates that the proposed model has some advantages in robot object tracking in terms of speed and efficiency.
基金supported by the National Natural Science Foundation of China(60835004 60775047+2 种基金 60872130)the National High Technology Research and Development Program of China(863 Program)(2007AA04Z244 2008AA04Z214)
文摘An object model-based tracking method is useful for tracking multiple objects, but the main difficulties are modeling objects reliably and tracking objects via models in successive frames. An effective tracking method using the object models is proposed to track multiple objects in a real-time visual surveillance system. Firstly, for detecting objects, an adaptive kernel density estimation method is utilized, which uses an adaptive bandwidth and features combining colour and gradient. Secondly, some models of objects are built for describing motion, shape and colour features. Then, a matching matrix is formed to analyze tracking situations. If objects are tracked under occlusions, the optimal "visual" object is found to represent the occluded object, and the posterior probability of pixel is used to determine which pixel is utilized for updating object models. Extensive experiments show that this method improves the accuracy and validity of tracking objects even under occlusions and is used in real-time visual surveillance systems.
文摘The performance and accuracy of computer vision systems are affected by noise in different forms.Although numerous solutions and algorithms have been presented for dealing with every type of noise,a comprehensive technique that can cover all the diverse noises and mitigate their damaging effects on the performance and precision of various systems is still missing.In this paper,we have focused on the stability and robustness of one computer vision branch(i.e.,visual object tracking).We have demonstrated that,without imposing a heavy computational load on a model or changing its algorithms,the drop in the performance and accuracy of a system when it is exposed to an unseen noise-laden test dataset can be prevented by simply applying the style transfer technique on the train dataset and training the model with a combination of these and the original untrained data.To verify our proposed approach,it is applied on a generic object tracker by using regression networks.This method’s validity is confirmed by testing it on an exclusive benchmark comprising 50 image sequences,with each sequence containing 15 types of noise at five different intensity levels.The OPE curves obtained show a 40%increase in the robustness of the proposed object tracker against noise,compared to the other trackers considered.
基金National High Technology Research and Development Programof China (863program,No.2002AA42D110-2)
文摘The 3D object visual tracking problem is studied for the robot vision system of the 220kV/330kV high-voltage live-line insulator cleaning robot. The SUSAN Edge based Scale Invariant Feature (SESIF) algorithm based 3D objects visual tracking is achieved in three stages: the first frame stage,tracking stage,and recovering stage. An SESIF based objects recognition algorithm is proposed to find initial location at both the first frame stage and recovering stage. An SESIF and Lie group based visual tracking algorithm is used to track 3D object. Experiments verify the algorithm's robustness. This algorithm will be used in the second generation of the 220kV/330kV high-voltage live-line insulator cleaning robot.
文摘In this paper, we have presented a novel tracking method aiming at detecting objects and maintaining their la-bel/identification over the time. The key factors of this method are to use depth information and different strategies to track objects under various occlusion scenarios. The foreground objects are detected and refined by background subtraction and shadow cancellation. The occlusion detection is based on information of foreground blobs in successive frames. The occlusion regions are projected to the projection plane XZ to analysis occlusion situation. According to the occlusion analysis results, different objects’ corresponding strategies are introduced to track objects under various occlusion scenarios including tracking occluded objects in similar depth layer and in different depth layers. The experimental results show that our proposed method can track the moving objects under the most typical and challenging occlusion scenarios.
基金supported by the National Natural Science Foundation of China under Grant 62177029the Postgraduate Research&Practice Innovation Program of Jiangsu Province(KYCX21_0740),China.
文摘Visual object tracking plays a crucial role in computer vision.In recent years,researchers have proposed various methods to achieve high-performance object tracking.Among these,methods based on Transformers have become a research hotspot due to their ability to globally model and contextualize information.However,current Transformer-based object tracking methods still face challenges such as low tracking accuracy and the presence of redundant feature information.In this paper,we introduce self-calibration multi-head self-attention Transformer(SMSTracker)as a solution to these challenges.It employs a hybrid tensor decomposition self-organizing multihead self-attention transformermechanism,which not only compresses and accelerates Transformer operations but also significantly reduces redundant data,thereby enhancing the accuracy and efficiency of tracking.Additionally,we introduce a self-calibration attention fusion block to resolve common issues of attention ambiguities and inconsistencies found in traditional trackingmethods,ensuring the stability and reliability of tracking performance across various scenarios.By integrating a hybrid tensor decomposition approach with a self-organizingmulti-head self-attentive transformer mechanism,SMSTracker enhances the efficiency and accuracy of the tracking process.Experimental results show that SMSTracker achieves competitive performance in visual object tracking,promising more robust and efficient tracking systems,demonstrating its potential to providemore robust and efficient tracking solutions in real-world applications.
基金supported in part by National Natural Science Foundation of China(Nos.U23A20384 and 62022021)in part by Joint Fund of Ministry of Education for Equipment Pre-research(No.8091B032155)+1 种基金in part by the National Defense Basic Scientific Research Program(No.WDZC20215250205)in part by Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2022JH6/100100026).
文摘Visual object tracking has been drawing increasing attention in recent years,as a fundamental task in computer vision.To extend the range of tracking applications,researchers have been introducing information from multiple modalities to handle specific scenes,with promising research prospects for emerging methods and benchmarks.To provide a thorough review of multi-modal tracking,different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy,with specific focus on visibledepth(RGB-D)and visible-thermal(RGB-T)tracking.Subsequently,a detailed description of the related benchmarks and challenges is provided.Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets:PTB,VOT19-RGBD,GTOT,RGBT234,and VOT19-RGBT.Finally,various future directions,including model design and dataset construction,are discussed from different perspectives for further research.
基金supported by grants PID2022-142946NA-I00 and PID2022-141159OB-I00,funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU.Recommended by Associate Editor Xin Luo.
文摘In the context of multiple-target tracking and surveillance applications,this paper investigates the challenge of determining the optimal positioning of a single autonomous aerial vehicle or agent equipped with multiple independently-steerable zooming cameras to effectively monitor a set of targets of interest.Each camera is dedicated to tracking a specific target or cluster of targets.The key innovation of this study,in comparison to existing approaches,lies in incorporating the zooming factor for the onboard cameras into the optimization problem.This enhancement offers greater flexibility during mission execution by allowing the autonomous agent to adjust the focal lengths of the onboard cameras,in exchange for varying real-world distances to the corresponding targets,thereby providing additional degrees of freedom to the optimization problem.The proposed optimization framework aims to strike a balance among various factors,including distance to the targets,verticality of viewpoints,and the required focal length for each camera.The primary focus of this paper is to establish the theoretical groundwork for addressing the non-convex nature of the optimization problem arising from these considerations.To this end,we develop an original convex approximation strategy.The paper also includes simulations of diverse scenarios,featuring varying numbers of onboard tracking cameras and target motion profiles,to validate the effectiveness of the proposed approach.
基金supported in part by National Natural Science Foundation of China(No.62176041)in part by Excellent Science and Technique Talent Foundation of Dalian(No.2022RY21).
文摘Significant advancements have beenwitnessed in visual tracking applications leveragingViT in recent years,mainly due to the formidablemodeling capabilities of Vision Transformer(ViT).However,the strong performance of such trackers heavily relies on ViT models pretrained for long periods,limitingmore flexible model designs for tracking tasks.To address this issue,we propose an efficient unsupervised ViT pretraining method for the tracking task based on masked autoencoders,called TrackMAE.During pretraining,we employ two shared-parameter ViTs,serving as the appearance encoder and motion encoder,respectively.The appearance encoder encodes randomly masked image data,while the motion encoder encodes randomly masked pairs of video frames.Subsequently,an appearance decoder and a motion decoder separately reconstruct the original image data and video frame data at the pixel level.In this way,ViT learns to understand both the appearance of images and the motion between video frames simultaneously.Experimental results demonstrate that ViT-Base and ViT-Large models,pretrained with TrackMAE and combined with a simple tracking head,achieve state-of-the-art(SOTA)performance without additional design.Moreover,compared to the currently popular MAE pretraining methods,TrackMAE consumes only 1/5 of the training time,which will facilitate the customization of diverse models for tracking.For instance,we additionally customize a lightweight ViT-XS,which achieves SOTA efficient tracking performance.