实时视觉目标跟踪与视频对象分割多任务框架被引量：8

Multitask framework for video object tracking and segmentation combined with multi-scale interframe information

导出

摘要目的针对视觉目标跟踪(video object tracking,VOT)和视频对象分割(video object segmentation,VOS)问题,研究人员提出了多个多任务处理框架,但是该类框架的精确度和鲁棒性较差。针对此问题,本文提出一个融合多尺度上下文信息和视频帧间信息的实时视觉目标跟踪与视频对象分割多任务的端到端框架。方法文中提出的架构使用了由空洞深度可分离卷积组成的更加多尺度的空洞空间金字塔池化模块,以及具备帧间信息的帧间掩模传播模块,使得网络对多尺度目标对象分割能力更强,同时具备更好的鲁棒性。结果本文方法在视觉目标跟踪VOT-2016和VOT-2018数据集上的期望平均重叠率(expected average overlap,EAO)分别达到了0.462和0.408,分别比Siam Mask高了0.029和0.028,达到了最先进的结果,并且表现出更好的鲁棒性。在视频对象分割DAVIS(densely annotated video segmentation)-2016和DAVIS-2017数据集上也取得了有竞争力的结果。其中,在多目标对象分割DAVIS-2017数据集上,本文方法比Siam Mask有更好的性能表现,区域相似度的杰卡德系数的平均值J_(M)和轮廓精确度的F度量的平均值F_(M)分别达到了56.0和59.0,并且区域和轮廓的衰变值J_(D)和F_(D)都比Siam Mask中的低,分别为17.9和19.8。同时运行速度为45帧/s,达到了实时的运行速度。结论文中提出的融合多尺度上下文信息和视频帧间信息的实时视觉目标跟踪与视频对象分割多任务的端到端框架,充分捕捉了多尺度上下文信息并且利用了视频帧间的信息,使得网络对多尺度目标对象分割能力更强的同时具备更好的鲁棒性。 Objective Visual object tracking (VOT) is widely used in scenes,such as car navigation,automatic video surveillance,and human-computer interaction. It is a basic research task in video applications and needs to infer the correspondence between the target and the frame. Given the position of any object of interest in the first frame of the video,its position is estimated in all subsequent frames with the highest possible accuracy. Similar to VOT,semi-supervised video object segmentation (VOS) requires segmentation of target objects on subsequent video sequences given the initial frame mask. It is also a basic research task of computer vision. However,the target object may experience large changes in pose,proportion,and appearance in the entire video sequence. It may encounter abnormal conditions,such as occlusion,rapid movement,and truncation. Therefore,performing robust VOT and VOS in a semi-supervised manner in video sequences is challenging. The continuous nature of the video sequence itself brings additional contextual information to VOS. The interframe consistency of video enables the network to effectively transfer information from frame to frame. In VOS,the information from previous frames can be regarded as temporal context and can provide useful hints for subsequent predictions.Therefore,the effective use of additional information brought by video is extremely important for video tasks. For the research of VOT and VOS,various multitask processing frameworks have been proposed by scholars. However,the accuracy and robustness of such frameworks are poor. This paper proposes a multitask end-to-end framework for real-time VOT and VOS to address these problems. This framework combines multi-scale context information and video interframe information.Method In this work,depthwise convolution is changed from depthwise convolution to atrous depthwise convolution,thereby forming the atrous depthwise separable convolution. In accordance with different atrous ratios,the convolution can have different receptive fields while maintaining its lightweight. This study designs an atrous spatial pyramid pooling module with many atrous ratios composed of atrous depthwise separable convolution and applies it to the VOS branch. The network can capture multi-scale context. This work uses 1,3,6,9,12,24,36,and 48 atrous ratios to convolve the feature map with different receptive fields and utilizes adaptive pooling for the feature map. These feature maps are concatenated,and a convolution kernel of 1 × 1 is used to transform the feature map channel. The feature map outputted by the model has rich multi-scale context information through these operations. This module uses the atrous depthwise separable convolution with different atrous rates for enabling the network to predict multi-scale targets. Continuity is a unique property of video sequences and causes additional contextual information to video tasks. The interframe consistency of video enables the network to effectively transfer information from frame to frame. In the VOS,the information from previous frames can be regarded as temporal context information and can provide useful hints for subsequent predictions. Therefore,the effective use of additional information brought by video is extremely important for video tasks. Inspired by the reference-guided mask propagation algorithm,a mask propagation module is added to the VOS branch for providing location and segmentation information to the network. The proposed mask propagation module is composed of 3 × 3 convolutions with atrous ratios of 2,3,and 6. In our architecture,a multi-scale atrous spatial pyramid pooling module composed of atrous depthwise separable convolutions and an interframe mask propagation module with interframe information are used. These modules provide the network with strong ability to segment multi-scale target objects and has better robustness. Result All experiments in this work are performed using NVIDIA TITAN X graphics cards. The network in this article is trained in two stages. The training sets used in different stages are different due to their different nature. In the first stage of training,this work uses Youtube-VOS,common objects in context (COCO),DETection (Image Net-DET),and Image Net-VID (VIDeo) datasets.For the datasets without mask ground truth,the mask branch is not trained. For a video sequence with only a single frame,the picture and mask of the previous frame are set in the interframe mask propagation module to be the same as the current frame. Inspired by Siam Mask,this article uses stochastic gradient descent optimizer algorithm and a warm-up training strategy. The learning rate increases from 1 × 10^(-3) to 5 × 10^(-3) in the first 5 epochs. A logarithmic decay strategy was then used to reduce the learning rate to 2. 5 × 10^(-4) through 15 epochs of learning. In the second stage,this article only uses the Youtube-VOS and COCO datasets for training. The two datasets have mask truth values to improve the segmentation effect of video objects. The second stage uses a logarithmic decay strategy to reduce the learning rate from 2. 5 × 10^(-4) to 1. 0 × 10^(-4) through 20 epochs. The expected average overlaps of the proposed method on the VOT-2016 and VOT-2018 datasets reach 0. 462 and 0. 408,respectively,which is approximately 0. 03 higher than Siam Mask. The proposed method achieves advanced results and shows better robustness. Competitive results are also achieved on the DAVIS-2016 and DAVIS-2017 datasets of VOS. On DAVIS-2017 dataset of multitarget object segmentation,the proposed method has better performance than Siam Mask. The evaluation indexes J_(M)and F_(M)reach 56. 0 and 59. 0,respectively,and the decay values of the region and the contour are J_(D)and F_(D). Their values are 17. 9 and 19. 8,respectively,which are lower than those in Siam Mask.The running speed is 45 frames per second,reaching a real-time running speed. Conclusion In this study,we proposed a multitask end-to-end framework of real-time VOT and VOS. The proposed method integrates multi-scale context information and video interframe information,fully captures multi-scale context information,and utilizes the information between video frames. These features make the network robust to segmentation of multi-scale target objects.

作者李瀚刘坤华刘嘉杰张晓晔 Li Han;Liu Kunhua;Liu Jiajie;Zhang Xiaoye(School of Data and Computer Science,Sun Yat-sen University,Guangzhou 510006,China;Guangdong Diankeyuan Energy Technology Co.,Ltd,Guangzhou 510080,China)

机构地区中山大学数据科学与计算机学院广东电科院能源技术有限责任公司

出处《中国图象图形学报》 CSCD 北大核心 2021年第1期101-112,共12页 Journal of Image and Graphics

基金国家重点研发计划项目(2018YFB1305002) 国家自然科学基金项目(61773414,62006256) 广州市重点研发项目(202007050002)。

关键词视觉目标跟踪视频对象分割全卷积网络空洞空间金字塔池化帧间掩模传播 visual object tracking(VOT) video object segmentation(VOS) fully convolutional network(FCN) atrous spatial pyramid pooling inter-frame mask propagation

分类号 TP391.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1李玺,查宇飞,张天柱,崔振,左旺孟,侯志强,卢湖川,王菡子.深度学习的目标跟踪算法综述[J].中国图象图形学报,2019,24(12):2057-2080. 被引量：108
2张艳琳,钱小燕,张淼,葛红娟.自适应多特征融合相关滤波目标跟踪[J].中国图象图形学报,2020,25(6):1160-1170. 被引量：7

二级参考文献5

1黄凯奇,陈晓棠,康运锋,谭铁牛.智能视频监控技术综述[J].计算机学报,2015,38(6):1093-1118. 被引量：399
2管皓,薛向阳,安志勇.深度学习在视频目标跟踪中的应用进展与展望[J].自动化学报,2016,42(6):834-847. 被引量：83
3毛宁,杨德东,杨福才,蔡玉柱.基于分层卷积特征的自适应目标跟踪[J].激光与光电子学进展,2016,53(12):195-206. 被引量：16
4鲁国智,彭冬亮,谷雨.多特征分层融合的相关滤波鲁棒跟踪[J].中国图象图形学报,2018,23(5):662-673. 被引量：4
5熊昌镇,车满强,王润玲.自适应卷积特征选择的实时跟踪算法[J].中国图象图形学报,2018,23(11):1742-1750. 被引量：2

共引文献112

1丁明远,蔡靖,周冕,薛彦兵,温显斌.跟踪状态自适应的判别式行人单目标跟踪算法研究[J].光电子．激光,2022,33(9):940-947. 被引量：1
2陈逸博.鲜花装扮迷人的巴黎[J].花卉,2000(3):34-34.
3沈继云.试析基于神经网络深度学习算法的人脸识别技术[J].信息技术与信息化,2020(5):228-230. 被引量：5
4陈星霖.国内基于深度学习的目标跟踪研究知识图谱分析[J].情报科学,2020,38(6):158-162. 被引量：2
5陈唯实,黄毅峰,卢贤锋.多传感器融合的无人机探测技术应用综述[J].现代雷达,2020,42(6):15-29. 被引量：17
6汤一明,刘玉菲,黄鸿.视觉单目标跟踪算法综述[J].测控技术,2020,39(8):21-34. 被引量：17
7徐曼,田秀霞.基于记忆模型的相关滤波目标跟踪[J].上海电力大学学报,2020,36(4):313-319.
8王殿伟,方浩宇,刘颖,伍世虔,谢永军,宋海军.一种基于改进RT-MDNet的全景视频目标跟踪算法[J].哈尔滨工业大学学报,2020,52(10):152-160. 被引量：11
9徐子睿,刘猛,谈雅婷.基于YOLOv4的车辆检测与流量统计研究[J].现代信息科技,2020,4(15):98-100. 被引量：15
10解博江,朱晶晶.采用HSV加权融合的核相关滤波跟踪方法[J].科学技术创新,2020(35):54-55.

同被引文献76

1马素刚,赵祥模,侯志强,王忠民,孙韩林.一种基于ResNet网络特征的视觉目标跟踪算法[J].北京邮电大学学报,2020(2):129-134. 被引量：8
2陈云文.行为科学对园林学发展的影响[J].中国园林,2008,24(3):21-24. 被引量：25
3戴菲,章俊华.规划设计学中的调查方法2——动线观察法[J].中国园林,2008,24(12):83-86. 被引量：18
4邓毅,蔡凌,李桔.可持续城市景观的动态集成规划设计体系[J].中国园林,2012,28(9):52-56. 被引量：7
5杨建华,林静,陈力.城市公共空间环境设施规划建设的现状问题分析[J].中国园林,2013,29(4):58-62. 被引量：15
6柴彦威,塔娜.中国时空间行为研究进展[J].地理科学进展,2013,32(9):1362-1373. 被引量：74
7张焕龙,胡士强,杨国胜.基于外观模型学习的视频目标跟踪方法综述[J].计算机研究与发展,2015,52(1):177-190. 被引量：64
8叶宇,庄宇,张灵珠,阿克丽丝·凡·内斯.城市设计中活力营造的形态学探究——基于城市空间形态特征量化分析与居民活动检验[J].国际城市规划,2016,31(1):26-33. 被引量：89
9周经美,赵祥模,程鑫,徐志刚,刘占文.结合光流法的车辆运动估计优化方法[J].哈尔滨工业大学学报,2016,48(9):65-69. 被引量：6
10李超,吴玉敬,徐飞飞,赵创社,张慧,孟立新.一种目标遮挡情况下的自动跟踪控制方法[J].应用光学,2017,38(5):713-718. 被引量：9

引证文献8

1吴韶集,胡一可.基于深度学习的公共空间人群行为可视化研究-以天津大学卫津路校区为例[J].风景园林,2022,29(2):106-111. 被引量：2
2付利华,赵宇,姜涵煦,赵茹,吴会贤,闫绍兴.基于前景感知视觉注意的半监督视频目标分割[J].电子学报,2022,50(1):195-206. 被引量：5
3简小女.基于无线传感器网络的目标自动跟踪研究[J].微型电脑应用,2022,38(8):137-139.
4姜文涛,崔江磊.旋转区域提议网络的孪生神经网络跟踪算法[J].计算机工程与应用,2022,58(24):247-255. 被引量：2
5康金龙,刘涛,谢祎霖,许涛,宫胜.改进的YOLOv3算法在视频分析中的应用[J].信息记录材料,2022,23(12):30-32.
6汪水源,侯志强,李富成,马素刚,余旺盛.自适应权重更新的轻量级视频目标分割算法[J].中国图象图形学报,2023,28(12):3772-3783.
7罗思涵,袁夏,梁永顺.多帧时空注意力引导的半监督视频分割[J].中国图象图形学报,2024,29(5):1233-1251.
8刘冠,邵继中,王宇琪,张雪茵,吕欣蓓.风景园林图像与图形在深度学习中的应用分析及未来展望[J].南京师大学报（自然科学版）,2024,47(2):44-53.

二级引证文献9

1SHI Mengyuan,GAO Junchai.Research on High Altitude Remote Sensing Building Segmentation Based on Improved U-Net Algorithm[J].Instrumentation,2021,8(4):47-54. 被引量：6
2陈鑫,彭建松.浅析大数据在风景园林中的应用研究进展[J].农业与技术,2023,43(2):115-118. 被引量：4
3郭晓轩,冯其波,冀振燕,郑发家,杨燕燕.多线激光光条图像缺陷分割模型研究[J].电子学报,2023,51(1):172-179. 被引量：3
4苏天康,宋慧慧,樊佳庆,张开华.深度信号引导学习混合变换器的高性能无监督视频目标分割[J].电子学报,2023,51(5):1388-1395.
5刘冠,邵继中,王宇琪,张雪茵,吕欣蓓.风景园林图像与图形在深度学习中的应用分析及未来展望[J].南京师大学报（自然科学版）,2024,47(2):44-53.
6李晨倩,刘俊.基于半监督和多尺度级联注意力的超声颈动脉斑块分割方法[J].计算机应用,2024,44(8):2604-2610.
7周明君,王朝立,孙占全.基于特征深层融合的吊装过程视频目标分割[J].上海理工大学学报,2024,46(4):407-416.
8魏瑶坤,康运江,王丹伟,赵鹏,徐斌.改进YOLOv5s的旋转框工业零件检测算法[J].激光与光电子学进展,2024,61(14):145-154.
9李铭涵.孪生神经网络在目标跟踪中的算法研究[J].人工智能与机器人研究,2022,11(3):278-287.

1张周贤,秦方亮,唐天航,李顺,王丽,钱晓琼,阳治平.一种自顶向下的单阶段多尺度目标检测算法[J].计算机应用研究,2020,37(S02):365-366. 被引量：1
2崔峥,王俊元.基于FSLIC的自适应绝缘子图像分割方法[J].机械设计与制造工程,2020,49(2):50-55.
3刘燕德,曾体伟,陈洞滨,王观田.基于区域相似信息的自适应运动目标检测算法[J].计算机工程,2020,46(3):273-279. 被引量：8
4昝珊珊,李波.融合改进YOLOv2网络的视觉多目标跟踪方法[J].小型微型计算机系统,2020,41(12):2601-2606. 被引量：10
5秦晓飞,张一鹏,陈浩胜,李夏,何致远.基于区域提案孪生网络的优化目标跟踪算法[J].光学仪器,2021,43(1):14-20.
6胡屹杉,秦品乐,曾建潮,柴锐,王丽芳.基于特征融合和动态多尺度空洞卷积的超声甲状腺分割网络[J].计算机应用,2021,41(3):891-897. 被引量：5
7周凌,廖华英.交传中学习者认知负荷超载现象成因及应对策略[J].东华理工大学学报（社会科学版）,2020,39(6):582-587. 被引量：1
8孔繁秀,孙瑶.基于CNKI的《西藏民族大学学报》文献计量可视化分析[J].西藏民族大学学报（哲学社会科学版）,2020,41(6):17-25. 被引量：2
9敦惠霞,陈晓玲.基于科学知识图谱的国外数据库盐碱地领域文献研究分析[J].北方农业学报,2021,49(1):119-126. 被引量：2

中国图象图形学报

2021年第1期

浏览历史

内容加载中请稍等...

实时视觉目标跟踪与视频对象分割多任务框架被引量：8

参考文献2

二级参考文献5

共引文献112

同被引文献76

引证文献8

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

实时视觉目标跟踪与视频对象分割多任务框架 被引量：8

参考文献2

二级参考文献5

共引文献112

同被引文献76

引证文献8

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

实时视觉目标跟踪与视频对象分割多任务框架被引量：8