We introduce a novel method using a new generative model that automatically learns effective representations of the target and background appearance to detect,segment and track each instance in a video sequence.Differ...We introduce a novel method using a new generative model that automatically learns effective representations of the target and background appearance to detect,segment and track each instance in a video sequence.Differently from current discriminative tracking-by-detection solutions,our proposed hierarchical structural embedding learning can predict more highquality masks with accurate boundary details over spatio-temporal space via the normalizing flows.We formulate the instance inference procedure as a hierarchical spatio-temporal embedded learning across time and space.Given the video clip,our method first coarsely locates pixels belonging to a particular instance with Gaussian distribution and then builds a novel mixing distribution to promote the instance boundary by fusing hierarchical appearance embedding information in a coarse-to-fine manner.For the mixing distribution,we utilize a factorization condition normalized flow fashion to estimate the distribution parameters to improve the segmentation performance.Comprehensive qualitative,quantitative,and ablation experiments are performed on three representative video instance segmentation benchmarks(i.e.,YouTube-VIS19,YouTube-VIS21,and OVIS)and the effectiveness of the proposed method is demonstrated.More impressively,the superior performance of our model on an unsupervised video object segmentation dataset(i.e.,DAVIS19)proves its generalizability.Our algorithm implementations are publicly available at https://github.com/zyqin19/HEVis.展开更多
Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to ...Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to obtain.To tackle this problem,we make an early attempt to achieve video object segmentation with scribble-level supervision,which can alleviate large amounts of human labor for collecting the manual annotation.However,using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete.To address this issue,this paper introduces two novel elements to learn the video object segmentation model.The first one is the scribble attention module,which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background.The other one is the scribble-supervised loss,which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.To evaluate the proposed method,we implement experiments on two video object segmentation benchmark datasets,You Tube-video object segmentation(VOS),and densely annotated video segmentation(DAVIS)-2017.We first generate the scribble annotations from the original per-pixel annotations.Then,we train our model and compare its test performance with the baseline models and other existing works.Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.展开更多
While the development of particular video segmentation algorithms has attracted considerable research interest, relatively little effort has been devoted to provide a methodology for evaluating their performance. In t...While the development of particular video segmentation algorithms has attracted considerable research interest, relatively little effort has been devoted to provide a methodology for evaluating their performance. In this paper, we propose a methodology to objectively evaluate video segmentation algorithm with ground-truth, which is based on computing the deviation of segmentation results from the reference segmentation. Four different metrics based on classification pixels, edges, relative foreground area and relative position respectively are combined to address the spatial accuracy. Temporal coherency is evaluated by utilizing the difference of spatial accuracy between successive frames. The experimental results show the feasibility of our approach. Moreover, it is computationally more efficient than previous methods. It can be applied to provide an offline ranking among different segmentation algorithms and to optimally set the parameters for a given algorithm.展开更多
In order to detect the object in video efficiently, an automatic and real time video segmentation algorithm based on background model and color clustering is proposed. This algorithm consists of four phases: backgroun...In order to detect the object in video efficiently, an automatic and real time video segmentation algorithm based on background model and color clustering is proposed. This algorithm consists of four phases: background restoration, moving objects extract, moving objects region clustering and post processing. The threshold of the background restoration is not given in advanced. It can be gotten automatically. And a new object region cluster algorithm based on background model and color clustering to remove significance noise is proposed. An efficient method of eliminating shadow is also used. This approach was compared with other methods on pixel error ratio. The experiment result indicates the algorithm is correct and efficient.展开更多
Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)i...Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)is a suitable option for human segmentation,and the ensemble of multiple DL-based segmentation models can improve the segmentation result.However,these approaches are not as effective when directly applied to the image segmentation in a video.This paper proposes an Adaptive N-Frames Ensemble(AFE)approach for high-movement human segmentation in a video using an ensemble of multiple DL models.In contrast to an ensemble,which executes multiple DL models simultaneously for every single video frame,the proposed AFE approach executes only a single DL model upon a current video frame.It combines the segmentation outputs of previous frames for the final segmentation output when the frame difference is less than a particular threshold.Our method employs the idea of the N-Frames Ensemble(NFE)method,which uses the ensemble of the image segmentation of a current video frame and previous video frames.However,NFE is not suitable for the segmentation of fast-moving objects in a video nor a video with low frame rates.The proposed AFE approach addresses the limitations of the NFE method.Our experiment uses three human segmentation models,namely Fully Convolutional Network(FCN),DeepLabv3,and Mediapipe.We evaluated our approach using 1711 videos of the TikTok50f dataset with a single-person view.The TikTok50f dataset is a reconstructed version of the publicly available TikTok dataset by cropping,resizing and dividing it into videos having 50 frames each.This paper compares the proposed AFE with single models and the Two-Models Ensemble,as well as the NFE models.The experiment results show that the proposed AFE is suitable for low-movement as well as high-movement human segmentation in a video.展开更多
With the development of the modern information society, more and more multimedia information is available. So the technology of multimedia processing is becoming the important task for the irrelevant area of scientist...With the development of the modern information society, more and more multimedia information is available. So the technology of multimedia processing is becoming the important task for the irrelevant area of scientist. Among of the multimedia, the visual informarion is more attractive due to its direct, vivid characteristic, but at the same rime the huge amount of video data causes many challenges if the video storage, processing and transmission.展开更多
Segmentation of semantic Video Object Planes (VOP's) from video sequence is a key to the standard MPEG-4 with content-based video coding. In this paper, the approach of automatic Segmentation of VOP's Based on...Segmentation of semantic Video Object Planes (VOP's) from video sequence is a key to the standard MPEG-4 with content-based video coding. In this paper, the approach of automatic Segmentation of VOP's Based on Spatio-Temporal Information (SBSTI) is proposed.The proceeding results demonstrate the good performance of the algorithm.展开更多
This paper presents a video motion object segmentation method based on area selection. This method uses a simple and practical space first region segmentation method, it through the motion information and space-time e...This paper presents a video motion object segmentation method based on area selection. This method uses a simple and practical space first region segmentation method, it through the motion information and space-time energy model to multiple choice of area, at lask the accurate segmentation object can be obtained throuth some post-processing technology. Experiments prove that this algorithm has good robustness.展开更多
In the present technological world,surveillance cameras generate an immense amount of video data from various sources,making its scrutiny tough for computer vision specialists.It is difficult to search for anomalous e...In the present technological world,surveillance cameras generate an immense amount of video data from various sources,making its scrutiny tough for computer vision specialists.It is difficult to search for anomalous events manually in thesemassive video records since they happen infrequently and with a low probability in real-world monitoring systems.Therefore,intelligent surveillance is a requirement of the modern day,as it enables the automatic identification of normal and aberrant behavior using artificial intelligence and computer vision technologies.In this article,we introduce an efficient Attention-based deep-learning approach for anomaly detection in surveillance video(ADSV).At the input of the ADSV,a shots boundary detection technique is used to segment prominent frames.Next,The Lightweight ConvolutionNeuralNetwork(LWCNN)model receives the segmented frames to extract spatial and temporal information from the intermediate layer.Following that,spatial and temporal features are learned using Long Short-Term Memory(LSTM)cells and Attention Network from a series of frames for each anomalous activity in a sample.To detect motion and action,the LWCNN received chronologically sorted frames.Finally,the anomaly activity in the video is identified using the proposed trained ADSV model.Extensive experiments are conducted on complex and challenging benchmark datasets.In addition,the experimental results have been compared to state-ofthe-artmethodologies,and a significant improvement is attained,demonstrating the efficiency of our ADSV method.展开更多
Current mainstream unsupervised video object segmentation(UVOS) approaches typically incorporate optical flow as motion information to locate the primary objects in coherent video frames. However, they fuse appearance...Current mainstream unsupervised video object segmentation(UVOS) approaches typically incorporate optical flow as motion information to locate the primary objects in coherent video frames. However, they fuse appearance and motion information without evaluating the quality of the optical flow. When poor-quality optical flow is used for the interaction with the appearance information, it introduces significant noise and leads to a decline in overall performance. To alleviate this issue, we first employ a quality evaluation module(QEM) to evaluate the optical flow. Then, we select high-quality optical flow as motion cues to fuse with the appearance information, which can prevent poor-quality optical flow from diverting the network's attention. Moreover, we design an appearance-guided fusion module(AGFM) to better integrate appearance and motion information. Extensive experiments on several widely utilized datasets, including DAVIS-16, FBMS-59, and You Tube-Objects, demonstrate that the proposed method outperforms existing methods.展开更多
Video object segmentation is important for video surveillance, object tracking, video object recognition and video editing. An adaptive video segmentation algorithm based on hidden conditional random fields (HCRFs) is...Video object segmentation is important for video surveillance, object tracking, video object recognition and video editing. An adaptive video segmentation algorithm based on hidden conditional random fields (HCRFs) is proposed, which models spatio-temporal constraints of video sequence. In order to improve the segmentation quality, the weights of spatio-temporal con- straints are adaptively updated by on-line learning for HCRFs. Shadows are the factors affecting segmentation quality. To separate foreground objects from the shadows they cast, linear transform for Gaussian distribution of the background is adopted to model the shadow. The experimental results demonstrated that the error ratio of our algorithm is reduced by 23% and 19% respectively, compared with the Gaussian mixture model (GMM) and spatio-temporal Markov random fields (MRFs).展开更多
Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient ...Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.展开更多
We present a lightweight and efficient semisupervised video object segmentation network based on the space-time memory framework.To some extent,our method solves the two difficulties encountered in traditional video o...We present a lightweight and efficient semisupervised video object segmentation network based on the space-time memory framework.To some extent,our method solves the two difficulties encountered in traditional video object segmentation:one is that the single frame calculation time is too long,and the other is that the current frame’s segmentation should use more information from past frames.The algorithm uses a global context(GC)module to achieve highperformance,real-time segmentation.The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time.Moreover,the prediction mask of the previous frame is helpful for the segmentation of the current frame,so we input it into a spatial constraint module(SCM),which constrains the areas of segments in the current frame.The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources.We added a refinement module to the decoder to improve boundary segmentation.Our model achieves state-of-the-art results on various datasets,scoring 80.1%on YouTube-VOS 2018 and a J&F score of 78.0%on DAVIS 2017,while taking 0.05 s per frame on the DAVIS 2016 validation dataset.展开更多
We present the first comprehensive video polyp segmentation(VPS)study in the deep learning era.Over the years,developments in VPS are not moving forward with ease due to the lack of a large-scale dataset with fine-gra...We present the first comprehensive video polyp segmentation(VPS)study in the deep learning era.Over the years,developments in VPS are not moving forward with ease due to the lack of a large-scale dataset with fine-grained segmentation annotations.To address this issue,we first introduce a high-quality frame-by-frame annotated VPS dataset,named SUN-SEG,which contains 158690colonoscopy video frames from the well-known SUN-database.We provide additional annotation covering diverse types,i.e.,attribute,object mask,boundary,scribble,and polygon.Second,we design a simple but efficient baseline,named PNS+,which consists of a global encoder,a local encoder,and normalized self-attention(NS)blocks.The global and local encoders receive an anchor frame and multiple successive frames to extract long-term and short-term spatial-temporal representations,which are then progressively refined by two NS blocks.Extensive experiments show that PNS+achieves the best performance and real-time inference speed(170 fps),making it a promising solution for the VPS task.Third,we extensively evaluate 13 representative polyp/object segmentation models on our SUN-SEG dataset and provide attribute-based comparisons.Finally,we discuss several open issues and suggest possible research directions for the VPS community.Our project and dataset are publicly available at https://github.com/GewelsJI/VPS.展开更多
We propose an automatic video segmentation method based on an optimized SaliencyCut equipped with information centroid(IC)detection according to level balance principle in physical theory.Unlike the existing methods,t...We propose an automatic video segmentation method based on an optimized SaliencyCut equipped with information centroid(IC)detection according to level balance principle in physical theory.Unlike the existing methods,the image information of another dimension is provided by the IC to enhance the video segmentation accuracy.Specifically,our IC is implemented based on the information-level balance principle in the image,and denoted as the information pivot by aggregating all the image information to a point.To effectively enhance the saliency value of the target object and suppress the background area,we also combine the color and the coordinate information of the image in calculating the local IC and the global IC in the image.Then saliency maps for all frames in the video are calculated based on the detected IC.By applying IC smoothing to enhance the optimized saliency detection,we can further correct the unsatisfied saliency maps,where sharp variations of colors or motions may exist in complex videos.Finally,we obtain the segmentation results based on IC-based saliency maps and optimized SaliencyCut.Our method is evaluated on the DAVIS dataset,consisting of different kinds of challenging videos.Comparisons with the state-of-the-art methods are also conducted to evaluate our method.Convincing visual results and statistical comparisons demonstrate its advantages and robustness for automatic video segmentation.展开更多
Extracting moving targets from video accurately is of great significance in the field of intelligent transport.To some extent,it is related to video segmentation or matting.In this paper,we propose a non-interactive a...Extracting moving targets from video accurately is of great significance in the field of intelligent transport.To some extent,it is related to video segmentation or matting.In this paper,we propose a non-interactive automatic segmentation method for extracting moving targets.First,the motion knowledge in video is detected with orthogonal Gaussian-Hermite moments and the Otsu algorithm,and the knowledge is treated as foreground seeds.Second,the background seeds are generated with distance transformation based on foreground seeds.Third,the foreground and background seeds are treated as extra constraints,and then a mask is generated using graph cuts methods or closed-form solutions.Comparison showed that the closed-form solution based on soft segmentation has a better performance and that the extra constraint has a larger impact on the result than other parameters.Experiments demonstrated that the proposed method can effectively extract moving targets from video in real time.展开更多
A shot presents a contiguous action recorded by an uninterrupted camera operation and frames within a shot keep spatio-temporal coherence. Segmenting a serial video stream file into meaningful shots is the first pass ...A shot presents a contiguous action recorded by an uninterrupted camera operation and frames within a shot keep spatio-temporal coherence. Segmenting a serial video stream file into meaningful shots is the first pass for the task of video analysis, content-based video understanding. In this paper, a novel scheme based on improved two-dimensional entropy is proposed to complete the partition of video shots. Firstly, shot transition candidates are detected using a two-pass algorithm: a coarse searching pass and a fine searching pass. Secondly, with the character of two-dimensional entropy of the image, correctly detected transition candidates are further classified into different transition types whereas those falsely detected shot breaks are distinguished and removed. Finally, the boundary of gradual transition can be precisely located by merging the characters of two-dimensional entropy of the image into the gradual transition. A large number of video sequences are used to test our system performance and promising results are obtained.展开更多
Video structure analysis is a basic requirement for most content-based video editing and processing systems. This paper presents a fast video structure analysis method based on image segmentation in each frame, with r...Video structure analysis is a basic requirement for most content-based video editing and processing systems. This paper presents a fast video structure analysis method based on image segmentation in each frame, with region matching between frames. The structure analysis decomposes the video into several moving objects, including information about their colors, positions, shapes, movements, and lifetimes. The method also supports user interactions to improve the results. The result shows that this method is fast and stable and can complete video analyzinq interactivelv.展开更多
基金supported in part by the National Natural Science Foundation of China(62176139,62106128,62176141)the Major Basic Research Project of Shandong Natural Science Foundation(ZR2021ZD15)+4 种基金the Natural Science Foundation of Shandong Province(ZR2021QF001)the Young Elite Scientists Sponsorship Program by CAST(2021QNRC001)the Open Project of Key Laboratory of Artificial Intelligence,Ministry of Educationthe Shandong Provincial Natural Science Foundation for Distinguished Young Scholars(ZR2021JQ26)the Taishan Scholar Project of Shandong Province(tsqn202103088)。
文摘We introduce a novel method using a new generative model that automatically learns effective representations of the target and background appearance to detect,segment and track each instance in a video sequence.Differently from current discriminative tracking-by-detection solutions,our proposed hierarchical structural embedding learning can predict more highquality masks with accurate boundary details over spatio-temporal space via the normalizing flows.We formulate the instance inference procedure as a hierarchical spatio-temporal embedded learning across time and space.Given the video clip,our method first coarsely locates pixels belonging to a particular instance with Gaussian distribution and then builds a novel mixing distribution to promote the instance boundary by fusing hierarchical appearance embedding information in a coarse-to-fine manner.For the mixing distribution,we utilize a factorization condition normalized flow fashion to estimate the distribution parameters to improve the segmentation performance.Comprehensive qualitative,quantitative,and ablation experiments are performed on three representative video instance segmentation benchmarks(i.e.,YouTube-VIS19,YouTube-VIS21,and OVIS)and the effectiveness of the proposed method is demonstrated.More impressively,the superior performance of our model on an unsupervised video object segmentation dataset(i.e.,DAVIS19)proves its generalizability.Our algorithm implementations are publicly available at https://github.com/zyqin19/HEVis.
基金supported in part by the National Key R&D Program of China(2017YFB0502904)the National Science Foundation of China(61876140)。
文摘Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to obtain.To tackle this problem,we make an early attempt to achieve video object segmentation with scribble-level supervision,which can alleviate large amounts of human labor for collecting the manual annotation.However,using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete.To address this issue,this paper introduces two novel elements to learn the video object segmentation model.The first one is the scribble attention module,which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background.The other one is the scribble-supervised loss,which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.To evaluate the proposed method,we implement experiments on two video object segmentation benchmark datasets,You Tube-video object segmentation(VOS),and densely annotated video segmentation(DAVIS)-2017.We first generate the scribble annotations from the original per-pixel annotations.Then,we train our model and compare its test performance with the baseline models and other existing works.Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.
文摘While the development of particular video segmentation algorithms has attracted considerable research interest, relatively little effort has been devoted to provide a methodology for evaluating their performance. In this paper, we propose a methodology to objectively evaluate video segmentation algorithm with ground-truth, which is based on computing the deviation of segmentation results from the reference segmentation. Four different metrics based on classification pixels, edges, relative foreground area and relative position respectively are combined to address the spatial accuracy. Temporal coherency is evaluated by utilizing the difference of spatial accuracy between successive frames. The experimental results show the feasibility of our approach. Moreover, it is computationally more efficient than previous methods. It can be applied to provide an offline ranking among different segmentation algorithms and to optimally set the parameters for a given algorithm.
基金the Ministerial Level Advanced Research Foundation(10405033)
文摘In order to detect the object in video efficiently, an automatic and real time video segmentation algorithm based on background model and color clustering is proposed. This algorithm consists of four phases: background restoration, moving objects extract, moving objects region clustering and post processing. The threshold of the background restoration is not given in advanced. It can be gotten automatically. And a new object region cluster algorithm based on background model and color clustering to remove significance noise is proposed. An efficient method of eliminating shadow is also used. This approach was compared with other methods on pixel error ratio. The experiment result indicates the algorithm is correct and efficient.
基金This research was financially supported by the Ministry of Small and Medium-sized Enterprises(SMEs)and Startups(MSS)Korea,under the“Regional Specialized Industry Development Program(R&D,S3091627)”supervised by the Korea Institute for Advancement of Technology(KIAT).
文摘Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)is a suitable option for human segmentation,and the ensemble of multiple DL-based segmentation models can improve the segmentation result.However,these approaches are not as effective when directly applied to the image segmentation in a video.This paper proposes an Adaptive N-Frames Ensemble(AFE)approach for high-movement human segmentation in a video using an ensemble of multiple DL models.In contrast to an ensemble,which executes multiple DL models simultaneously for every single video frame,the proposed AFE approach executes only a single DL model upon a current video frame.It combines the segmentation outputs of previous frames for the final segmentation output when the frame difference is less than a particular threshold.Our method employs the idea of the N-Frames Ensemble(NFE)method,which uses the ensemble of the image segmentation of a current video frame and previous video frames.However,NFE is not suitable for the segmentation of fast-moving objects in a video nor a video with low frame rates.The proposed AFE approach addresses the limitations of the NFE method.Our experiment uses three human segmentation models,namely Fully Convolutional Network(FCN),DeepLabv3,and Mediapipe.We evaluated our approach using 1711 videos of the TikTok50f dataset with a single-person view.The TikTok50f dataset is a reconstructed version of the publicly available TikTok dataset by cropping,resizing and dividing it into videos having 50 frames each.This paper compares the proposed AFE with single models and the Two-Models Ensemble,as well as the NFE models.The experiment results show that the proposed AFE is suitable for low-movement as well as high-movement human segmentation in a video.
文摘With the development of the modern information society, more and more multimedia information is available. So the technology of multimedia processing is becoming the important task for the irrelevant area of scientist. Among of the multimedia, the visual informarion is more attractive due to its direct, vivid characteristic, but at the same rime the huge amount of video data causes many challenges if the video storage, processing and transmission.
文摘Segmentation of semantic Video Object Planes (VOP's) from video sequence is a key to the standard MPEG-4 with content-based video coding. In this paper, the approach of automatic Segmentation of VOP's Based on Spatio-Temporal Information (SBSTI) is proposed.The proceeding results demonstrate the good performance of the algorithm.
文摘This paper presents a video motion object segmentation method based on area selection. This method uses a simple and practical space first region segmentation method, it through the motion information and space-time energy model to multiple choice of area, at lask the accurate segmentation object can be obtained throuth some post-processing technology. Experiments prove that this algorithm has good robustness.
基金This research was supported by the Chung-Ang University Research Scholarship Grants in 2021 and the Culture,Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture,Sports,and Tourism in 2022(Project Name:Development of Digital Quarantine and Operation Technologies for Creation of Safe Viewing Environment in Cultural Facilities,Project Number:R2021040028,Contribution Rate:100%).
文摘In the present technological world,surveillance cameras generate an immense amount of video data from various sources,making its scrutiny tough for computer vision specialists.It is difficult to search for anomalous events manually in thesemassive video records since they happen infrequently and with a low probability in real-world monitoring systems.Therefore,intelligent surveillance is a requirement of the modern day,as it enables the automatic identification of normal and aberrant behavior using artificial intelligence and computer vision technologies.In this article,we introduce an efficient Attention-based deep-learning approach for anomaly detection in surveillance video(ADSV).At the input of the ADSV,a shots boundary detection technique is used to segment prominent frames.Next,The Lightweight ConvolutionNeuralNetwork(LWCNN)model receives the segmented frames to extract spatial and temporal information from the intermediate layer.Following that,spatial and temporal features are learned using Long Short-Term Memory(LSTM)cells and Attention Network from a series of frames for each anomalous activity in a sample.To detect motion and action,the LWCNN received chronologically sorted frames.Finally,the anomaly activity in the video is identified using the proposed trained ADSV model.Extensive experiments are conducted on complex and challenging benchmark datasets.In addition,the experimental results have been compared to state-ofthe-artmethodologies,and a significant improvement is attained,demonstrating the efficiency of our ADSV method.
基金supported by the National Natural Science Foundation of China (No.61872189)。
文摘Current mainstream unsupervised video object segmentation(UVOS) approaches typically incorporate optical flow as motion information to locate the primary objects in coherent video frames. However, they fuse appearance and motion information without evaluating the quality of the optical flow. When poor-quality optical flow is used for the interaction with the appearance information, it introduces significant noise and leads to a decline in overall performance. To alleviate this issue, we first employ a quality evaluation module(QEM) to evaluate the optical flow. Then, we select high-quality optical flow as motion cues to fuse with the appearance information, which can prevent poor-quality optical flow from diverting the network's attention. Moreover, we design an appearance-guided fusion module(AGFM) to better integrate appearance and motion information. Extensive experiments on several widely utilized datasets, including DAVIS-16, FBMS-59, and You Tube-Objects, demonstrate that the proposed method outperforms existing methods.
基金Project supported by the National Natural Science Foundation of China (Nos. 60473106, 60273060 and 60333010)the Ministry of Education of China (No. 20030335064)the Education Depart-ment of Zhejiang Province, China (No. G20030433)
文摘Video object segmentation is important for video surveillance, object tracking, video object recognition and video editing. An adaptive video segmentation algorithm based on hidden conditional random fields (HCRFs) is proposed, which models spatio-temporal constraints of video sequence. In order to improve the segmentation quality, the weights of spatio-temporal con- straints are adaptively updated by on-line learning for HCRFs. Shadows are the factors affecting segmentation quality. To separate foreground objects from the shadows they cast, linear transform for Gaussian distribution of the background is adopted to model the shadow. The experimental results demonstrated that the error ratio of our algorithm is reduced by 23% and 19% respectively, compared with the Gaussian mixture model (GMM) and spatio-temporal Markov random fields (MRFs).
基金This work was supported by the National Natural Science Foundation of China(62176169,61703077,and 62102207).
文摘Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.
基金partially supported by the National Natural Science Foundation of China(Grant Nos.61802197,62072449,and 61632003)the Science and Technology Development Fund,Macao SAR(Grant Nos.0018/2019/AKP and SKL-IOTSC(UM)-2021-2023)+1 种基金the Guangdong Science and Technology Department(Grant No.2020B1515130001)University of Macao(Grant Nos.MYRG2020-00253-FST and MYRG2022-00059-FST).
文摘We present a lightweight and efficient semisupervised video object segmentation network based on the space-time memory framework.To some extent,our method solves the two difficulties encountered in traditional video object segmentation:one is that the single frame calculation time is too long,and the other is that the current frame’s segmentation should use more information from past frames.The algorithm uses a global context(GC)module to achieve highperformance,real-time segmentation.The GC module can effectively integrate multi-frame image information without increased memory and can process each frame in real time.Moreover,the prediction mask of the previous frame is helpful for the segmentation of the current frame,so we input it into a spatial constraint module(SCM),which constrains the areas of segments in the current frame.The SCM effectively alleviates mismatching of similar targets yet consumes few additional resources.We added a refinement module to the decoder to improve boundary segmentation.Our model achieves state-of-the-art results on various datasets,scoring 80.1%on YouTube-VOS 2018 and a J&F score of 78.0%on DAVIS 2017,while taking 0.05 s per frame on the DAVIS 2016 validation dataset.
基金supported by the National Natural Science Foundation of China(No.62072223)supported by the Natural Science Foundation of Fujian Province,China(No.2020J01131199)。
文摘We present the first comprehensive video polyp segmentation(VPS)study in the deep learning era.Over the years,developments in VPS are not moving forward with ease due to the lack of a large-scale dataset with fine-grained segmentation annotations.To address this issue,we first introduce a high-quality frame-by-frame annotated VPS dataset,named SUN-SEG,which contains 158690colonoscopy video frames from the well-known SUN-database.We provide additional annotation covering diverse types,i.e.,attribute,object mask,boundary,scribble,and polygon.Second,we design a simple but efficient baseline,named PNS+,which consists of a global encoder,a local encoder,and normalized self-attention(NS)blocks.The global and local encoders receive an anchor frame and multiple successive frames to extract long-term and short-term spatial-temporal representations,which are then progressively refined by two NS blocks.Extensive experiments show that PNS+achieves the best performance and real-time inference speed(170 fps),making it a promising solution for the VPS task.Third,we extensively evaluate 13 representative polyp/object segmentation models on our SUN-SEG dataset and provide attribute-based comparisons.Finally,we discuss several open issues and suggest possible research directions for the VPS community.Our project and dataset are publicly available at https://github.com/GewelsJI/VPS.
基金This work was supported in part by the Major Project of the New Generation of Artificial Intelligence of National Key Research and Development Project,Ministry of Science and Technology of China under Grant No.2018AAA0102900the National Natural Science Foundation of China under Grant Nos.61572328 and 61973221+1 种基金the Natural Science Foundation of Guangdong Province of China under Grant Nos.2018A030313381 and 2019A1515011165The Hong Kong Polytechnic University under Grant Nos.P0030419 and P0030929.
文摘We propose an automatic video segmentation method based on an optimized SaliencyCut equipped with information centroid(IC)detection according to level balance principle in physical theory.Unlike the existing methods,the image information of another dimension is provided by the IC to enhance the video segmentation accuracy.Specifically,our IC is implemented based on the information-level balance principle in the image,and denoted as the information pivot by aggregating all the image information to a point.To effectively enhance the saliency value of the target object and suppress the background area,we also combine the color and the coordinate information of the image in calculating the local IC and the global IC in the image.Then saliency maps for all frames in the video are calculated based on the detected IC.By applying IC smoothing to enhance the optimized saliency detection,we can further correct the unsatisfied saliency maps,where sharp variations of colors or motions may exist in complex videos.Finally,we obtain the segmentation results based on IC-based saliency maps and optimized SaliencyCut.Our method is evaluated on the DAVIS dataset,consisting of different kinds of challenging videos.Comparisons with the state-of-the-art methods are also conducted to evaluate our method.Convincing visual results and statistical comparisons demonstrate its advantages and robustness for automatic video segmentation.
基金Project (No. 61033003) supported by the National Natural Science Foundation of China
文摘Extracting moving targets from video accurately is of great significance in the field of intelligent transport.To some extent,it is related to video segmentation or matting.In this paper,we propose a non-interactive automatic segmentation method for extracting moving targets.First,the motion knowledge in video is detected with orthogonal Gaussian-Hermite moments and the Otsu algorithm,and the knowledge is treated as foreground seeds.Second,the background seeds are generated with distance transformation based on foreground seeds.Third,the foreground and background seeds are treated as extra constraints,and then a mask is generated using graph cuts methods or closed-form solutions.Comparison showed that the closed-form solution based on soft segmentation has a better performance and that the extra constraint has a larger impact on the result than other parameters.Experiments demonstrated that the proposed method can effectively extract moving targets from video in real time.
基金Supported by the National Natural Science Foundation of China (Grant No.60675017)National Basic Research Program of China (Grant No.2006CB303103)
文摘A shot presents a contiguous action recorded by an uninterrupted camera operation and frames within a shot keep spatio-temporal coherence. Segmenting a serial video stream file into meaningful shots is the first pass for the task of video analysis, content-based video understanding. In this paper, a novel scheme based on improved two-dimensional entropy is proposed to complete the partition of video shots. Firstly, shot transition candidates are detected using a two-pass algorithm: a coarse searching pass and a fine searching pass. Secondly, with the character of two-dimensional entropy of the image, correctly detected transition candidates are further classified into different transition types whereas those falsely detected shot breaks are distinguished and removed. Finally, the boundary of gradual transition can be precisely located by merging the characters of two-dimensional entropy of the image into the gradual transition. A large number of video sequences are used to test our system performance and promising results are obtained.
基金Supported by the National Key Basic Research and Development (973) Program of China (No. 2006CB303106)the Specialized Research Fund for the Doctoral Program of Higher Education of MOE, P.R.C. (No. 20060003057)and the Basic Research Foun-dation of Tsinghua National Laboratory for Information Science and Technology (TNList)
文摘Video structure analysis is a basic requirement for most content-based video editing and processing systems. This paper presents a fast video structure analysis method based on image segmentation in each frame, with region matching between frames. The structure analysis decomposes the video into several moving objects, including information about their colors, positions, shapes, movements, and lifetimes. The method also supports user interactions to improve the results. The result shows that this method is fast and stable and can complete video analyzinq interactivelv.