Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to ...Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to obtain.To tackle this problem,we make an early attempt to achieve video object segmentation with scribble-level supervision,which can alleviate large amounts of human labor for collecting the manual annotation.However,using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete.To address this issue,this paper introduces two novel elements to learn the video object segmentation model.The first one is the scribble attention module,which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background.The other one is the scribble-supervised loss,which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.To evaluate the proposed method,we implement experiments on two video object segmentation benchmark datasets,You Tube-video object segmentation(VOS),and densely annotated video segmentation(DAVIS)-2017.We first generate the scribble annotations from the original per-pixel annotations.Then,we train our model and compare its test performance with the baseline models and other existing works.Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.展开更多
While the development of particular video segmentation algorithms has attracted considerable research interest, relatively little effort has been devoted to provide a methodology for evaluating their performance. In t...While the development of particular video segmentation algorithms has attracted considerable research interest, relatively little effort has been devoted to provide a methodology for evaluating their performance. In this paper, we propose a methodology to objectively evaluate video segmentation algorithm with ground-truth, which is based on computing the deviation of segmentation results from the reference segmentation. Four different metrics based on classification pixels, edges, relative foreground area and relative position respectively are combined to address the spatial accuracy. Temporal coherency is evaluated by utilizing the difference of spatial accuracy between successive frames. The experimental results show the feasibility of our approach. Moreover, it is computationally more efficient than previous methods. It can be applied to provide an offline ranking among different segmentation algorithms and to optimally set the parameters for a given algorithm.展开更多
Segmentation of semantic Video Object Planes (VOP's) from video sequence is a key to the standard MPEG-4 with content-based video coding. In this paper, the approach of automatic Segmentation of VOP's Based on...Segmentation of semantic Video Object Planes (VOP's) from video sequence is a key to the standard MPEG-4 with content-based video coding. In this paper, the approach of automatic Segmentation of VOP's Based on Spatio-Temporal Information (SBSTI) is proposed.The proceeding results demonstrate the good performance of the algorithm.展开更多
Current mainstream unsupervised video object segmentation(UVOS) approaches typically incorporate optical flow as motion information to locate the primary objects in coherent video frames. However, they fuse appearance...Current mainstream unsupervised video object segmentation(UVOS) approaches typically incorporate optical flow as motion information to locate the primary objects in coherent video frames. However, they fuse appearance and motion information without evaluating the quality of the optical flow. When poor-quality optical flow is used for the interaction with the appearance information, it introduces significant noise and leads to a decline in overall performance. To alleviate this issue, we first employ a quality evaluation module(QEM) to evaluate the optical flow. Then, we select high-quality optical flow as motion cues to fuse with the appearance information, which can prevent poor-quality optical flow from diverting the network's attention. Moreover, we design an appearance-guided fusion module(AGFM) to better integrate appearance and motion information. Extensive experiments on several widely utilized datasets, including DAVIS-16, FBMS-59, and You Tube-Objects, demonstrate that the proposed method outperforms existing methods.展开更多
In order to detect the object in video efficiently, an automatic and real time video segmentation algorithm based on background model and color clustering is proposed. This algorithm consists of four phases: backgroun...In order to detect the object in video efficiently, an automatic and real time video segmentation algorithm based on background model and color clustering is proposed. This algorithm consists of four phases: background restoration, moving objects extract, moving objects region clustering and post processing. The threshold of the background restoration is not given in advanced. It can be gotten automatically. And a new object region cluster algorithm based on background model and color clustering to remove significance noise is proposed. An efficient method of eliminating shadow is also used. This approach was compared with other methods on pixel error ratio. The experiment result indicates the algorithm is correct and efficient.展开更多
Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient ...Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.展开更多
We present a lightweight and efficient semisupervised video object segmentation network based on the space-time memory framework.To some extent,our method solves the two difficulties encountered in traditional video o...We present a lightweight and efficient semisupervised video object segmentation network based on the space-time memory framework.To some extent,our method solves the two difficulties encountered in traditional video object segmentation:one is that the single frame calculation time is too long,and the other is that the current frame’s segmentation should use more information from past frames.The algorithm uses a global context(GC)module to achieve highperformance,real-time segmentation.The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time.Moreover,the prediction mask of the previous frame is helpful for the segmentation of the current frame,so we input it into a spatial constraint module(SCM),which constrains the areas of segments in the current frame.The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources.We added a refinement module to the decoder to improve boundary segmentation.Our model achieves state-of-the-art results on various datasets,scoring 80.1%on YouTube-VOS 2018 and a J&F score of 78.0%on DAVIS 2017,while taking 0.05 s per frame on the DAVIS 2016 validation dataset.展开更多
This paper proposes a motion-based region growing segmentation scheme for the object-based video coding, which segments an image into homogeneous regions characterized by a coherent motion. It adopts a block matching ...This paper proposes a motion-based region growing segmentation scheme for the object-based video coding, which segments an image into homogeneous regions characterized by a coherent motion. It adopts a block matching algorithm to estimate motion vectors and uses morphological tools such as open-close by reconstruction and the region-growing version of the watershed algorithm for spatial segmentation to improve the temporal segmentation. In order to determine the reliable motion vectors, this paper also proposes a change detection algorithm and a multi-candidate pro- screening motion estimation method. Preliminary simulation results demonstrate that the proposed scheme is feasible. The main advantage of the scheme is its low computational load.展开更多
大量基于深度学习的无监督视频目标分割(Unsupervised video object segmentation,UVOS)算法存在模型参数量与计算量较大的问题,这显著限制了算法在实际中的应用.提出了基于运动引导的视频目标分割网络,在大幅降低模型参数量与计算量的...大量基于深度学习的无监督视频目标分割(Unsupervised video object segmentation,UVOS)算法存在模型参数量与计算量较大的问题,这显著限制了算法在实际中的应用.提出了基于运动引导的视频目标分割网络,在大幅降低模型参数量与计算量的同时,提升视频目标分割性能.整个模型由双流网络、运动引导模块、多尺度渐进融合模块三部分组成.具体地,首先,RGB图像与光流估计输入双流网络提取物体外观特征与运动特征;然后,运动引导模块通过局部注意力提取运动特征中的语义信息,用于引导外观特征学习丰富的语义信息;最后,多尺度渐进融合模块获取双流网络的各个阶段输出的特征,将深层特征渐进地融入浅层特征,最终提升边缘分割效果.在3个标准数据集上进行了大量评测,实验结果表明了该方法的优越性能.展开更多
Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS...Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS has also entered the era of deep models toward spatiotemporal feature learning.This paper aims to provide the latest review of recent DL-based MOS methods proposed during the past three years.Specifically,we present a more up-to-date categorization based on model characteristics,then compare and discuss each category from feature learning(FL),and model training and evaluation perspectives.For FL,the methods reviewed are divided into three types:spatial FL,temporal FL,and spatiotemporal FL,then analyzed from input and model architectures aspects,three input types,and four typical preprocessing subnetworks are summarized.In terms of training,we discuss ideas for enhancing model transferability.In terms of evaluation,based on a previous categorization of scene dependent evaluation and scene independent evaluation,and combined with whether used videos are recorded with static or moving cameras,we further provide four subdivided evaluation setups and analyze that of reviewed methods.We also show performance comparisons of some reviewed MOS methods and analyze the advantages and disadvantages of reviewed MOS methods in terms of technology.Finally,based on the above comparisons and discussions,we present research prospects and future directions.展开更多
基金supported in part by the National Key R&D Program of China(2017YFB0502904)the National Science Foundation of China(61876140)。
文摘Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to obtain.To tackle this problem,we make an early attempt to achieve video object segmentation with scribble-level supervision,which can alleviate large amounts of human labor for collecting the manual annotation.However,using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete.To address this issue,this paper introduces two novel elements to learn the video object segmentation model.The first one is the scribble attention module,which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background.The other one is the scribble-supervised loss,which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.To evaluate the proposed method,we implement experiments on two video object segmentation benchmark datasets,You Tube-video object segmentation(VOS),and densely annotated video segmentation(DAVIS)-2017.We first generate the scribble annotations from the original per-pixel annotations.Then,we train our model and compare its test performance with the baseline models and other existing works.Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.
文摘While the development of particular video segmentation algorithms has attracted considerable research interest, relatively little effort has been devoted to provide a methodology for evaluating their performance. In this paper, we propose a methodology to objectively evaluate video segmentation algorithm with ground-truth, which is based on computing the deviation of segmentation results from the reference segmentation. Four different metrics based on classification pixels, edges, relative foreground area and relative position respectively are combined to address the spatial accuracy. Temporal coherency is evaluated by utilizing the difference of spatial accuracy between successive frames. The experimental results show the feasibility of our approach. Moreover, it is computationally more efficient than previous methods. It can be applied to provide an offline ranking among different segmentation algorithms and to optimally set the parameters for a given algorithm.
文摘Segmentation of semantic Video Object Planes (VOP's) from video sequence is a key to the standard MPEG-4 with content-based video coding. In this paper, the approach of automatic Segmentation of VOP's Based on Spatio-Temporal Information (SBSTI) is proposed.The proceeding results demonstrate the good performance of the algorithm.
基金supported by the National Natural Science Foundation of China (No.61872189)。
文摘Current mainstream unsupervised video object segmentation(UVOS) approaches typically incorporate optical flow as motion information to locate the primary objects in coherent video frames. However, they fuse appearance and motion information without evaluating the quality of the optical flow. When poor-quality optical flow is used for the interaction with the appearance information, it introduces significant noise and leads to a decline in overall performance. To alleviate this issue, we first employ a quality evaluation module(QEM) to evaluate the optical flow. Then, we select high-quality optical flow as motion cues to fuse with the appearance information, which can prevent poor-quality optical flow from diverting the network's attention. Moreover, we design an appearance-guided fusion module(AGFM) to better integrate appearance and motion information. Extensive experiments on several widely utilized datasets, including DAVIS-16, FBMS-59, and You Tube-Objects, demonstrate that the proposed method outperforms existing methods.
基金the Ministerial Level Advanced Research Foundation(10405033)
文摘In order to detect the object in video efficiently, an automatic and real time video segmentation algorithm based on background model and color clustering is proposed. This algorithm consists of four phases: background restoration, moving objects extract, moving objects region clustering and post processing. The threshold of the background restoration is not given in advanced. It can be gotten automatically. And a new object region cluster algorithm based on background model and color clustering to remove significance noise is proposed. An efficient method of eliminating shadow is also used. This approach was compared with other methods on pixel error ratio. The experiment result indicates the algorithm is correct and efficient.
基金Supported by the National Natural Science Foundation of China (No. 60772134, 60902081, 60902052) the 111 Project (No.B08038) the Fundamental Research Funds for the Central Universities(No.72105457).
基金This work was supported by the National Natural Science Foundation of China(62176169,61703077,and 62102207).
文摘Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.
基金partially supported by the National Natural Science Foundation of China(Grant Nos.61802197,62072449,and 61632003)the Science and Technology Development Fund,Macao SAR(Grant Nos.0018/2019/AKP and SKL-IOTSC(UM)-2021-2023)+1 种基金the Guangdong Science and Technology Department(Grant No.2020B1515130001)University of Macao(Grant Nos.MYRG2020-00253-FST and MYRG2022-00059-FST).
文摘We present a lightweight and efficient semisupervised video object segmentation network based on the space-time memory framework.To some extent,our method solves the two difficulties encountered in traditional video object segmentation:one is that the single frame calculation time is too long,and the other is that the current frame’s segmentation should use more information from past frames.The algorithm uses a global context(GC)module to achieve highperformance,real-time segmentation.The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time.Moreover,the prediction mask of the previous frame is helpful for the segmentation of the current frame,so we input it into a spatial constraint module(SCM),which constrains the areas of segments in the current frame.The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources.We added a refinement module to the decoder to improve boundary segmentation.Our model achieves state-of-the-art results on various datasets,scoring 80.1%on YouTube-VOS 2018 and a J&F score of 78.0%on DAVIS 2017,while taking 0.05 s per frame on the DAVIS 2016 validation dataset.
文摘This paper proposes a motion-based region growing segmentation scheme for the object-based video coding, which segments an image into homogeneous regions characterized by a coherent motion. It adopts a block matching algorithm to estimate motion vectors and uses morphological tools such as open-close by reconstruction and the region-growing version of the watershed algorithm for spatial segmentation to improve the temporal segmentation. In order to determine the reliable motion vectors, this paper also proposes a change detection algorithm and a multi-candidate pro- screening motion estimation method. Preliminary simulation results demonstrate that the proposed scheme is feasible. The main advantage of the scheme is its low computational load.
文摘大量基于深度学习的无监督视频目标分割(Unsupervised video object segmentation,UVOS)算法存在模型参数量与计算量较大的问题,这显著限制了算法在实际中的应用.提出了基于运动引导的视频目标分割网络,在大幅降低模型参数量与计算量的同时,提升视频目标分割性能.整个模型由双流网络、运动引导模块、多尺度渐进融合模块三部分组成.具体地,首先,RGB图像与光流估计输入双流网络提取物体外观特征与运动特征;然后,运动引导模块通过局部注意力提取运动特征中的语义信息,用于引导外观特征学习丰富的语义信息;最后,多尺度渐进融合模块获取双流网络的各个阶段输出的特征,将深层特征渐进地融入浅层特征,最终提升边缘分割效果.在3个标准数据集上进行了大量评测,实验结果表明了该方法的优越性能.
基金National Natural Science Foundation of China(Nos.61702323 and 62172268)the Shanghai Municipal Natural Science Foundation,China(No.20ZR1423100)+2 种基金the Open Fund of Science and Technology on Thermal Energy and Power Laboratory(No.TPL2020C02)Wuhan 2nd Ship Design and Research Institute,Wuhan,China,the National Key Research and Development Program of China(No.2018YFB1306303)the Major Basic Research Projects of Natural Science Foundation of Shandong Province,China(No.ZR2019ZD07).
文摘Moving object segmentation(MOS),aiming at segmenting moving objects from video frames,is an important and challenging task in computer vision and with various applications.With the development of deep learning(DL),MOS has also entered the era of deep models toward spatiotemporal feature learning.This paper aims to provide the latest review of recent DL-based MOS methods proposed during the past three years.Specifically,we present a more up-to-date categorization based on model characteristics,then compare and discuss each category from feature learning(FL),and model training and evaluation perspectives.For FL,the methods reviewed are divided into three types:spatial FL,temporal FL,and spatiotemporal FL,then analyzed from input and model architectures aspects,three input types,and four typical preprocessing subnetworks are summarized.In terms of training,we discuss ideas for enhancing model transferability.In terms of evaluation,based on a previous categorization of scene dependent evaluation and scene independent evaluation,and combined with whether used videos are recorded with static or moving cameras,we further provide four subdivided evaluation setups and analyze that of reviewed methods.We also show performance comparisons of some reviewed MOS methods and analyze the advantages and disadvantages of reviewed MOS methods in terms of technology.Finally,based on the above comparisons and discussions,we present research prospects and future directions.