摘要
With the deepening of neural network research,object detection has been developed rapidly in recent years,and video object detection methods have gradually attracted the attention of scholars,especially frameworks including multiple object tracking and detection.Most current works prefer to build the paradigm for multiple object tracking and detection by multi-task learning.Different with others,a multi-level temporal feature fusion structure is proposed in this paper to improve the performance of framework by utilizing the constraint of video temporal consistency.For training the temporal network end-to-end,a feature exchange training strategy is put forward for training the temporal feature fusion structure efficiently.The proposed method is tested on several acknowledged benchmarks,and encouraging results are obtained compared with the famous joint detection and tracking framework.The ablation experiment answers the problem of a good position for temporal feature fusion.
基金
supported by the Zhejiang Provincial Natural Science Foundation of China(No.LQ24F020016)
the General Scientific Research Fund of Zhejiang Provincial Education Department(No.Y202248776)
the Wenzhou Science and Technology Plan Project(No.G20220035)
the Wenzhou Key Laboratory Construction Project(No.2022HZSY0048)。