The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method in...The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.展开更多
Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing com...Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.展开更多
What causes object detection in video to be less accurate than it is in still images?Because some video frames have degraded in appearance from fast movement,out-of-focus camera shots,and changes in posture.These reas...What causes object detection in video to be less accurate than it is in still images?Because some video frames have degraded in appearance from fast movement,out-of-focus camera shots,and changes in posture.These reasons have made video object detection(VID)a growing area of research in recent years.Video object detection can be used for various healthcare applications,such as detecting and tracking tumors in medical imaging,monitoring the movement of patients in hospitals and long-term care facilities,and analyzing videos of surgeries to improve technique and training.Additionally,it can be used in telemedicine to help diagnose and monitor patients remotely.Existing VID techniques are based on recurrent neural networks or optical flow for feature aggregation to produce reliable features which can be used for detection.Some of those methods aggregate features on the full-sequence level or from nearby frames.To create feature maps,existing VID techniques frequently use Convolutional Neural Networks(CNNs)as the backbone network.On the other hand,Vision Transformers have outperformed CNNs in various vision tasks,including object detection in still images and image classification.We propose in this research to use Swin-Transformer,a state-of-the-art Vision Transformer,as an alternative to CNN-based backbone networks for object detection in videos.The proposed architecture enhances the accuracy of existing VID methods.The ImageNet VID and EPIC KITCHENS datasets are used to evaluate the suggested methodology.We have demonstrated that our proposed method is efficient by achieving 84.3%mean average precision(mAP)on ImageNet VID using less memory in comparison to other leading VID techniques.The source code is available on the website https://github.com/amaharek/SwinVid.展开更多
Video processing is one challenge in collecting vehicle trajectories from unmanned aerial vehicle(UAV) and road boundary estimation is one way to improve the video processing algorithms. However, current methods do no...Video processing is one challenge in collecting vehicle trajectories from unmanned aerial vehicle(UAV) and road boundary estimation is one way to improve the video processing algorithms. However, current methods do not work well for low volume road, which is not well-marked and with noises such as vehicle tracks. A fusion-based method termed Dempster-Shafer-based road detection(DSRD) is proposed to address this issue. This method detects road boundary by combining multiple information sources using Dempster-Shafer theory(DST). In order to test the performance of the proposed method, two field experiments were conducted, one of which was on a highway partially covered by snow and another was on a dense traffic highway. The results show that DSRD is robust and accurate, whose detection rates are 100% and 99.8% compared with manual detection results. Then, DSRD is adopted to improve UAV video processing algorithm, and the vehicle detection and tracking rate are improved by 2.7% and 5.5%,respectively. Also, the computation time has decreased by 5% and 8.3% for two experiments, respectively.展开更多
A number of automated video shot boundary detection methods for indexing a videosequence to facilitate browsing and retrieval have been proposed in recent years.Among these methods,the dissolve shot boundary isn't...A number of automated video shot boundary detection methods for indexing a videosequence to facilitate browsing and retrieval have been proposed in recent years.Among these methods,the dissolve shot boundary isn't accurately detected because it involves the camera operation and objectmovement.In this paper,a method based on support vector machine (SVM) is proposed to detect thedissolve shot boundary in MPEG compressed sequence.The problem of detection between the dissolveshot boundary and other boundaries is considered as two-class classification in our method.Featuresfrom the compressed sequences are directly extracted without decoding them,and the optimal classboundary between two classes are learned from training data by using SVM.Experiments,whichcompare various classification methods,show that using proposed method encourages performance ofvideo shot boundary detection.展开更多
Background Video anomaly detection has always been a hot topic and has attracted increasing attention.Many of the existing methods for video anomaly detection depend on processing the entire video rather than consider...Background Video anomaly detection has always been a hot topic and has attracted increasing attention.Many of the existing methods for video anomaly detection depend on processing the entire video rather than considering only the significant context. Method This paper proposes a novel video anomaly detection method called COVAD that mainly focuses on the region of interest in the video instead of the entire video. Our proposed COVAD method is based on an autoencoded convolutional neural network and a coordinated attention mechanism,which can effectively capture meaningful objects in the video and dependencies among different objects. Relying on the existing memory-guided video frame prediction network, our algorithm can significantly predict the future motion and appearance of objects in a video more effectively. Result The proposed algorithm obtained better experimental results on multiple datasets and outperformed the baseline models considered in our analysis. Simultaneously, we provide an improved visual test that can provide pixel-level anomaly explanations.展开更多
In recent years,with the rapid development of deepfake technology,a large number of deepfake videos have emerged on the Internet,which poses a huge threat to national politics,social stability,and personal privacy.Alt...In recent years,with the rapid development of deepfake technology,a large number of deepfake videos have emerged on the Internet,which poses a huge threat to national politics,social stability,and personal privacy.Although many existing deepfake detection methods exhibit excellent performance for known manipulations,their detection capabilities are not strong when faced with unknown manipulations.Therefore,in order to obtain better generalization ability,this paper analyzes global and local inter-frame dynamic inconsistencies from the perspective of spatial and frequency domains,and proposes a Local region Frequency Guided Dynamic Inconsistency Network(LFGDIN).The network includes two parts:Global SpatioTemporal Network(GSTN)and Local Region Frequency Guided Module(LRFGM).The GSTN is responsible for capturing the dynamic information of the entire face,while the LRFGM focuses on extracting the frequency dynamic information of the eyes and mouth.The LRFGM guides the GTSN to concentrate on dynamic inconsistency in some significant local regions through local region alignment,so as to improve the model's detection performance.Experiments on the three public datasets(FF++,DFDC,and Celeb-DF)show that compared with many recent advanced methods,the proposed method achieves better detection results when detecting deepfake videos of unknown manipulation types.展开更多
In this paper, a video fire detection method is proposed, which demonstrated good performance in indoor environment. Three main novel ideas have been introduced. Firstly, a flame color model in RGB and HIS color space...In this paper, a video fire detection method is proposed, which demonstrated good performance in indoor environment. Three main novel ideas have been introduced. Firstly, a flame color model in RGB and HIS color space is used to extract pre-detected regions instead of traditional motion differential method, as it’s more suitable for fire detection in indoor environment. Secondly, according to the flicker characteristic of the flame, similarity and two main values of centroid motion are proposed. At the same time, a simple but effective method for tracking the same regions in consecutive frames is established. Thirdly,a multi-expert system consisting of color component dispersion,similarity and centroid motion is established to identify flames.The proposed method has been tested on a very large dataset of fire videos acquired both in real indoor environment tests and from the Internet. The experimental results show that the proposed approach achieved a balance between the false positive rate and the false negative rate, and demonstrated a better performance in terms of overall accuracy and F standard with respect to other similar fire detection methods in indoor environment.展开更多
A real-time pedestrian detection and tracking system using a single video camera was developed to monitor pedestrians. This system contained six modules: video flow capture, pre-processing, movement detection, shadow ...A real-time pedestrian detection and tracking system using a single video camera was developed to monitor pedestrians. This system contained six modules: video flow capture, pre-processing, movement detection, shadow removal, tracking, and object classification. The Gaussian mixture model was utilized to extract the moving object from an image sequence segmented by the mean-shift technique in the pre-processing module. Shadow removal was used to alleviate the negative impact of the shadow to the detected objects. A model-free method was adopted to identify pedestrians. The maximum and minimum integration methods were developed to integrate multiple cues into the mean-shift algorithm and the initial tracking iteration with the competent integrated probability distribution map for object tracking. A simple but effective algorithm was proposed to handle full occlusion cases. The system was tested using real traffic videos from different sites. The results of the test confirm that the system is reliable and has an overall accuracy of over 85%.展开更多
Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from th...Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from the given sequence.Although such a learning scheme is generally effective,it has a critical limitation,i.e.,the model learned on sparse frames only possesses weak generalization ability.This situation could become worse on“long”videos since they tend to have intensive scene variations.Moreover,in such videos,the keyframe information from a longer time span is less relevant to the previous,which could also cause learning conflict and deteriorate the model performance.Thus,the learning scheme is usually incapable of handling complex pattern modeling.To solve this problem,we propose a divide-and-conquer framework,which can convert a complex problem domain into multiple simple ones.First,we devise a novel background consistency analysis(BCA)which effectively divides the mined frames into disjoint groups.Then for each group,we assign an individual deep model on it to capture its key attribute during the fine-tuning phase.During the testing phase,we design a model-matching strategy,which could dynamically select the best-matched model from those fine-tuned ones to handle the given testing frame.Comprehensive experiments show that our method can adapt severe background appearance variation coupling with object movement and obtain robust saliency detection compared with the previous scheme and the state-of-the-art methods.展开更多
Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,a...Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.展开更多
While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of ...While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of all ages.While a difficult task,detecting pornography can be the important step in determining the porn and adult content in a video.In this paper,an architecture is proposed which yielded high scores for both training and testing.This dataset was produced from 190 videos,yielding more than 19 h of videos.The main sources for the content were from YouTube,movies,torrent,and websites that hosts both pornographic and non-pornographic contents.The videos were from different ethnicities and skin color which ensures the models can detect any kind of video.A VGG16,Inception V3 and Resnet 50 models were initially trained to detect these pornographic images but failed to achieve a high testing accuracy with accuracies of 0.49,0.49 and 0.78 respectively.Finally,utilizing transfer learning,a convolutional neural network was designed and yielded an accuracy of 0.98.展开更多
Content-based video copy detection is an active research field due to the need for copyright pro- tection and business intellectual property protection. This paper gives a probabilistic spatiotemporal fusion approach ...Content-based video copy detection is an active research field due to the need for copyright pro- tection and business intellectual property protection. This paper gives a probabilistic spatiotemporal fusion approach for video copy detection. This approach directly estimates the location of the copy segment with a probabilistic graphical model. The spatial and temporal consistency of the video copy is embedded in the local probability function. An effective local descriptor and a two-level descriptor pairing method are used to build a video copy detection system to evaluate the approach. Tests show that it outperforms the popular voting algorithm and the probabilistic fusion framework based on the Hidden Markov Model, improving F-score (F1) by 8%.展开更多
This paper tackles the problem of video concept detection using the multi-modality fusion method. Motivated by multi-view learning algorithms, multi-modality features of videos can be represented by multiple graphs. A...This paper tackles the problem of video concept detection using the multi-modality fusion method. Motivated by multi-view learning algorithms, multi-modality features of videos can be represented by multiple graphs. And the graph-based semi-supervised learning methods can be extended to multiple graphs to predict the semantic labels for unlabeled video data. However, traditional graphs represent only homogeneous pairwise linking relations, and therefore the high-order correlations inherent in videos, such as high-order visual similarities, are ignored. In this paper we represent heterogeneous features by multiple hypergraphs and then the high-order correlated samples can be associated with hyperedges. Furthermore, the multi-hypergraph ranking (MHR) algorithm is proposed by defining Markov random walk on each hypergraph and then forming the mixture Markov chains so as to perform transductive learning in multiple hypergraphs. In experiments on the TRECVID dataset, a triple-hypergraph consisting of visual, textual features and multiple labeled tags is constructed to predict concept labels for unlabeled video shots by the MHR framework. Experimental results show that our approach is effective.展开更多
Content-based video copy detection becomes an active research field due to requirement of copyright protection, business intelligence, video retrieval, etc. Although it is assumed in the existing methods that referenc...Content-based video copy detection becomes an active research field due to requirement of copyright protection, business intelligence, video retrieval, etc. Although it is assumed in the existing methods that reference database consists of original videos, these videos are difficult to be obtained in many practical cases. In this paper, a copy detection method based on sparse repre- sentation is proposed to make use of some imperfect prototypes of original videos maintained in the reference database. A query video is represented as a linear combination of all the videos in the database. Then we can determine that whether the query has sibling videos in the database based on distribution of coefficients and find them out based on reconstruction error. The experiments show that even with very limited dimensional feature, this method can achieve high performance.展开更多
Object detection is one of the hottest research directions in computer vision,has already made impressive progress in academia,and has many valuable applications in the industry.However,the mainstream detection method...Object detection is one of the hottest research directions in computer vision,has already made impressive progress in academia,and has many valuable applications in the industry.However,the mainstream detection methods still have two shortcomings:(1)even a model that is well trained using large amounts of data still cannot generally be used across different kinds of scenes;(2)once a model is deployed,it cannot autonomously evolve along with the accumulated unlabeled scene data.To address these problems,and inspired by visual knowledge theory,we propose a novel scene-adaptive evolution unsupervised video object detection algorithm that can decrease the impact of scene changes through the concept of object groups.We first extract a large number of object proposals from unlabeled data through a pre-trained detection model.Second,we build the visual knowledge dictionary of object concepts by clustering the proposals,in which each cluster center represents an object prototype.Third,we look into the relations between different clusters and the object information of different groups,and propose a graph-based group information propagation strategy to determine the category of an object concept,which can effectively distinguish positive and negative proposals.With these pseudo labels,we can easily fine-tune the pretrained model.The effectiveness of the proposed method is verified by performing different experiments,and the significant improvements are achieved.展开更多
Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient ...Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.展开更多
Underwater robotic operation usually requires visual perception(e.g.,object detection and tracking),but underwater scenes have poor visual quality and represent a special domain which can affect the accuracy of visual...Underwater robotic operation usually requires visual perception(e.g.,object detection and tracking),but underwater scenes have poor visual quality and represent a special domain which can affect the accuracy of visual perception.In addition,detection continuity and stability are important for robotic perception,but the commonly used static accuracy based evaluation(i.e.,average precision)is insufficient to reflect detector performance across time.In response to these two problems,we present a design for a novel robotic visual perception framework.First,we generally investigate the relationship between a quality-diverse data domain and visual restoration in detection performance.As a result,although domain quality has an ignorable effect on within-domain detection accuracy,visual restoration is beneficial to detection in real sea scenarios by reducing the domain shift.Moreover,non-reference assessments are proposed for detection continuity and stability based on object tracklets.Further,online tracklet refinement is developed to improve the temporal performance of detectors.Finally,combined with visual restoration,an accurate and stable underwater robotic visual perception framework is established.Small-overlap suppression is proposed to extend video object detection(VID)methods to a single-object tracking task,leading to the flexibility to switch between detection and tracking.Extensive experiments were conducted on the ImageNet VID dataset and real-world robotic tasks to verify the correctness of our analysis and the superiority of our proposed approaches.The codes are available at https://github.com/yrqs/VisPerception.展开更多
基金Science and Technology Funds from the Liaoning Education Department(Serial Number:LJKZ0104).
文摘The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.
基金funded by the Natural Science Foundation China(NSFC)under Grant No.62203192.
文摘Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.
文摘What causes object detection in video to be less accurate than it is in still images?Because some video frames have degraded in appearance from fast movement,out-of-focus camera shots,and changes in posture.These reasons have made video object detection(VID)a growing area of research in recent years.Video object detection can be used for various healthcare applications,such as detecting and tracking tumors in medical imaging,monitoring the movement of patients in hospitals and long-term care facilities,and analyzing videos of surgeries to improve technique and training.Additionally,it can be used in telemedicine to help diagnose and monitor patients remotely.Existing VID techniques are based on recurrent neural networks or optical flow for feature aggregation to produce reliable features which can be used for detection.Some of those methods aggregate features on the full-sequence level or from nearby frames.To create feature maps,existing VID techniques frequently use Convolutional Neural Networks(CNNs)as the backbone network.On the other hand,Vision Transformers have outperformed CNNs in various vision tasks,including object detection in still images and image classification.We propose in this research to use Swin-Transformer,a state-of-the-art Vision Transformer,as an alternative to CNN-based backbone networks for object detection in videos.The proposed architecture enhances the accuracy of existing VID methods.The ImageNet VID and EPIC KITCHENS datasets are used to evaluate the suggested methodology.We have demonstrated that our proposed method is efficient by achieving 84.3%mean average precision(mAP)on ImageNet VID using less memory in comparison to other leading VID techniques.The source code is available on the website https://github.com/amaharek/SwinVid.
基金Project(2009AA11Z220)supported by the National High Technology Research and Development Program of China
文摘Video processing is one challenge in collecting vehicle trajectories from unmanned aerial vehicle(UAV) and road boundary estimation is one way to improve the video processing algorithms. However, current methods do not work well for low volume road, which is not well-marked and with noises such as vehicle tracks. A fusion-based method termed Dempster-Shafer-based road detection(DSRD) is proposed to address this issue. This method detects road boundary by combining multiple information sources using Dempster-Shafer theory(DST). In order to test the performance of the proposed method, two field experiments were conducted, one of which was on a highway partially covered by snow and another was on a dense traffic highway. The results show that DSRD is robust and accurate, whose detection rates are 100% and 99.8% compared with manual detection results. Then, DSRD is adopted to improve UAV video processing algorithm, and the vehicle detection and tracking rate are improved by 2.7% and 5.5%,respectively. Also, the computation time has decreased by 5% and 8.3% for two experiments, respectively.
文摘A number of automated video shot boundary detection methods for indexing a videosequence to facilitate browsing and retrieval have been proposed in recent years.Among these methods,the dissolve shot boundary isn't accurately detected because it involves the camera operation and objectmovement.In this paper,a method based on support vector machine (SVM) is proposed to detect thedissolve shot boundary in MPEG compressed sequence.The problem of detection between the dissolveshot boundary and other boundaries is considered as two-class classification in our method.Featuresfrom the compressed sequences are directly extracted without decoding them,and the optimal classboundary between two classes are learned from training data by using SVM.Experiments,whichcompare various classification methods,show that using proposed method encourages performance ofvideo shot boundary detection.
文摘Background Video anomaly detection has always been a hot topic and has attracted increasing attention.Many of the existing methods for video anomaly detection depend on processing the entire video rather than considering only the significant context. Method This paper proposes a novel video anomaly detection method called COVAD that mainly focuses on the region of interest in the video instead of the entire video. Our proposed COVAD method is based on an autoencoded convolutional neural network and a coordinated attention mechanism,which can effectively capture meaningful objects in the video and dependencies among different objects. Relying on the existing memory-guided video frame prediction network, our algorithm can significantly predict the future motion and appearance of objects in a video more effectively. Result The proposed algorithm obtained better experimental results on multiple datasets and outperformed the baseline models considered in our analysis. Simultaneously, we provide an improved visual test that can provide pixel-level anomaly explanations.
基金supported by the National Natural Science Foundation of China(Nos.62072251 and U22B2062)the Priority Academic Program Development of Jiangsu Higher Education Institutions fund.
文摘In recent years,with the rapid development of deepfake technology,a large number of deepfake videos have emerged on the Internet,which poses a huge threat to national politics,social stability,and personal privacy.Although many existing deepfake detection methods exhibit excellent performance for known manipulations,their detection capabilities are not strong when faced with unknown manipulations.Therefore,in order to obtain better generalization ability,this paper analyzes global and local inter-frame dynamic inconsistencies from the perspective of spatial and frequency domains,and proposes a Local region Frequency Guided Dynamic Inconsistency Network(LFGDIN).The network includes two parts:Global SpatioTemporal Network(GSTN)and Local Region Frequency Guided Module(LRFGM).The GSTN is responsible for capturing the dynamic information of the entire face,while the LRFGM focuses on extracting the frequency dynamic information of the eyes and mouth.The LRFGM guides the GTSN to concentrate on dynamic inconsistency in some significant local regions through local region alignment,so as to improve the model's detection performance.Experiments on the three public datasets(FF++,DFDC,and Celeb-DF)show that compared with many recent advanced methods,the proposed method achieves better detection results when detecting deepfake videos of unknown manipulation types.
基金supported by National Natural Science Foundation of China(41471387,41631072)
文摘In this paper, a video fire detection method is proposed, which demonstrated good performance in indoor environment. Three main novel ideas have been introduced. Firstly, a flame color model in RGB and HIS color space is used to extract pre-detected regions instead of traditional motion differential method, as it’s more suitable for fire detection in indoor environment. Secondly, according to the flicker characteristic of the flame, similarity and two main values of centroid motion are proposed. At the same time, a simple but effective method for tracking the same regions in consecutive frames is established. Thirdly,a multi-expert system consisting of color component dispersion,similarity and centroid motion is established to identify flames.The proposed method has been tested on a very large dataset of fire videos acquired both in real indoor environment tests and from the Internet. The experimental results show that the proposed approach achieved a balance between the false positive rate and the false negative rate, and demonstrated a better performance in terms of overall accuracy and F standard with respect to other similar fire detection methods in indoor environment.
基金Project(50778015)supported by the National Natural Science Foundation of ChinaProject(2012CB725403)supported by the Major State Basic Research Development Program of China
文摘A real-time pedestrian detection and tracking system using a single video camera was developed to monitor pedestrians. This system contained six modules: video flow capture, pre-processing, movement detection, shadow removal, tracking, and object classification. The Gaussian mixture model was utilized to extract the moving object from an image sequence segmented by the mean-shift technique in the pre-processing module. Shadow removal was used to alleviate the negative impact of the shadow to the detected objects. A model-free method was adopted to identify pedestrians. The maximum and minimum integration methods were developed to integrate multiple cues into the mean-shift algorithm and the initial tracking iteration with the competent integrated probability distribution map for object tracking. A simple but effective algorithm was proposed to handle full occlusion cases. The system was tested using real traffic videos from different sites. The results of the test confirm that the system is reliable and has an overall accuracy of over 85%.
基金supported in part by the CAMS Innovation Fund for Medical Sciences,China(No.2019-I2M5-016)National Natural Science Foundation of China(No.62172246)+1 种基金the Youth Innovation and Technology Support Plan of Colleges and Universities in Shandong Province,China(No.2021KJ062)National Science Foundation of USA(Nos.IIS-1715985 and IIS1812606).
文摘Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from the given sequence.Although such a learning scheme is generally effective,it has a critical limitation,i.e.,the model learned on sparse frames only possesses weak generalization ability.This situation could become worse on“long”videos since they tend to have intensive scene variations.Moreover,in such videos,the keyframe information from a longer time span is less relevant to the previous,which could also cause learning conflict and deteriorate the model performance.Thus,the learning scheme is usually incapable of handling complex pattern modeling.To solve this problem,we propose a divide-and-conquer framework,which can convert a complex problem domain into multiple simple ones.First,we devise a novel background consistency analysis(BCA)which effectively divides the mined frames into disjoint groups.Then for each group,we assign an individual deep model on it to capture its key attribute during the fine-tuning phase.During the testing phase,we design a model-matching strategy,which could dynamically select the best-matched model from those fine-tuned ones to handle the given testing frame.Comprehensive experiments show that our method can adapt severe background appearance variation coupling with object movement and obtain robust saliency detection compared with the previous scheme and the state-of-the-art methods.
文摘Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.
文摘While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of all ages.While a difficult task,detecting pornography can be the important step in determining the porn and adult content in a video.In this paper,an architecture is proposed which yielded high scores for both training and testing.This dataset was produced from 190 videos,yielding more than 19 h of videos.The main sources for the content were from YouTube,movies,torrent,and websites that hosts both pornographic and non-pornographic contents.The videos were from different ethnicities and skin color which ensures the models can detect any kind of video.A VGG16,Inception V3 and Resnet 50 models were initially trained to detect these pornographic images but failed to achieve a high testing accuracy with accuracies of 0.49,0.49 and 0.78 respectively.Finally,utilizing transfer learning,a convolutional neural network was designed and yielded an accuracy of 0.98.
基金Supported by the National Key Basic Research and Development (863) Program of China (No. 2007CB311003)
文摘Content-based video copy detection is an active research field due to the need for copyright pro- tection and business intellectual property protection. This paper gives a probabilistic spatiotemporal fusion approach for video copy detection. This approach directly estimates the location of the copy segment with a probabilistic graphical model. The spatial and temporal consistency of the video copy is embedded in the local probability function. An effective local descriptor and a two-level descriptor pairing method are used to build a video copy detection system to evaluate the approach. Tests show that it outperforms the popular voting algorithm and the probabilistic fusion framework based on the Hidden Markov Model, improving F-score (F1) by 8%.
基金supported by the National Natural Science Foundation of China(Nos.60603096 and 60673088)the National High-Tech Re-search and Development Program(863)of China(No.2006AA010107)the Program for Changjiang Scholars and Innovative Research Team in University of China(No.IRT0652)
文摘This paper tackles the problem of video concept detection using the multi-modality fusion method. Motivated by multi-view learning algorithms, multi-modality features of videos can be represented by multiple graphs. And the graph-based semi-supervised learning methods can be extended to multiple graphs to predict the semantic labels for unlabeled video data. However, traditional graphs represent only homogeneous pairwise linking relations, and therefore the high-order correlations inherent in videos, such as high-order visual similarities, are ignored. In this paper we represent heterogeneous features by multiple hypergraphs and then the high-order correlated samples can be associated with hyperedges. Furthermore, the multi-hypergraph ranking (MHR) algorithm is proposed by defining Markov random walk on each hypergraph and then forming the mixture Markov chains so as to perform transductive learning in multiple hypergraphs. In experiments on the TRECVID dataset, a triple-hypergraph consisting of visual, textual features and multiple labeled tags is constructed to predict concept labels for unlabeled video shots by the MHR framework. Experimental results show that our approach is effective.
文摘Content-based video copy detection becomes an active research field due to requirement of copyright protection, business intelligence, video retrieval, etc. Although it is assumed in the existing methods that reference database consists of original videos, these videos are difficult to be obtained in many practical cases. In this paper, a copy detection method based on sparse repre- sentation is proposed to make use of some imperfect prototypes of original videos maintained in the reference database. A query video is represented as a linear combination of all the videos in the database. Then we can determine that whether the query has sibling videos in the database based on distribution of coefficients and find them out based on reconstruction error. The experiments show that even with very limited dimensional feature, this method can achieve high performance.
基金Project supported by the National Key R&D Program of China(No.2020AAA010400X)and the Hikvision Open Fund,China。
文摘Object detection is one of the hottest research directions in computer vision,has already made impressive progress in academia,and has many valuable applications in the industry.However,the mainstream detection methods still have two shortcomings:(1)even a model that is well trained using large amounts of data still cannot generally be used across different kinds of scenes;(2)once a model is deployed,it cannot autonomously evolve along with the accumulated unlabeled scene data.To address these problems,and inspired by visual knowledge theory,we propose a novel scene-adaptive evolution unsupervised video object detection algorithm that can decrease the impact of scene changes through the concept of object groups.We first extract a large number of object proposals from unlabeled data through a pre-trained detection model.Second,we build the visual knowledge dictionary of object concepts by clustering the proposals,in which each cluster center represents an object prototype.Third,we look into the relations between different clusters and the object information of different groups,and propose a graph-based group information propagation strategy to determine the category of an object concept,which can effectively distinguish positive and negative proposals.With these pseudo labels,we can easily fine-tune the pretrained model.The effectiveness of the proposed method is verified by performing different experiments,and the significant improvements are achieved.
基金This work was supported by the National Natural Science Foundation of China(62176169,61703077,and 62102207).
文摘Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.
基金Project supported by the National Natural Science Foundation of China(Nos.61633004,61725305,and 62073196)the S&T Program of Hebei Province,China(No.F2020203037)。
文摘Underwater robotic operation usually requires visual perception(e.g.,object detection and tracking),but underwater scenes have poor visual quality and represent a special domain which can affect the accuracy of visual perception.In addition,detection continuity and stability are important for robotic perception,but the commonly used static accuracy based evaluation(i.e.,average precision)is insufficient to reflect detector performance across time.In response to these two problems,we present a design for a novel robotic visual perception framework.First,we generally investigate the relationship between a quality-diverse data domain and visual restoration in detection performance.As a result,although domain quality has an ignorable effect on within-domain detection accuracy,visual restoration is beneficial to detection in real sea scenarios by reducing the domain shift.Moreover,non-reference assessments are proposed for detection continuity and stability based on object tracklets.Further,online tracklet refinement is developed to improve the temporal performance of detectors.Finally,combined with visual restoration,an accurate and stable underwater robotic visual perception framework is established.Small-overlap suppression is proposed to extend video object detection(VID)methods to a single-object tracking task,leading to the flexibility to switch between detection and tracking.Extensive experiments were conducted on the ImageNet VID dataset and real-world robotic tasks to verify the correctness of our analysis and the superiority of our proposed approaches.The codes are available at https://github.com/yrqs/VisPerception.