The large-scale and sudden video content access such as flash crowds results in huge bandwidth demand,which severely influence user quality of experience and quality of service of video systems.In this paper,we firstl...The large-scale and sudden video content access such as flash crowds results in huge bandwidth demand,which severely influence user quality of experience and quality of service of video systems.In this paper,we firstly discuss the main reason of generation of flash crowds for video streaming services and analyze key factor for balance recovery between supply and demand of upload bandwidth.We construct two models:bandwidth supply capacity model of video systems and bandwidth demand model of users,which measures usage amount of bandwidth of the cloud.Based on the built models,we further employ a community-based cooperative caching strategy of video resources to promote supply capacity of upload bandwidth of video systems.Extensive tests show how the proposed cooperative caching strategy achieves much better performance results in comparison with original solution.展开更多
Ray-space based arbitrary viewpoint rendering without complex object segmentation or model construction is the main technology to realize Free Viewpoint Video(FVV) system for complex scenes. Ray-space interpolation an...Ray-space based arbitrary viewpoint rendering without complex object segmentation or model construction is the main technology to realize Free Viewpoint Video(FVV) system for complex scenes. Ray-space interpolation and compression are two key techniques for the solution. In this paper,correlation among multiple epipolar lines in ray-space data is analyzed,and a new method of ray-space interpolation with multi-epipolar lines matching is proposed. Comparing with the pixel-based matching interpolation method and the block-based matching interpolation method,the proposed method can achieve higher Peak Signal to Noise Ratio(PSNR) in interpolating rayspace data and rendering arbitrary viewpoint images.展开更多
In the realm of contemporary artificial intelligence,machine learning enables automation,allowing systems to naturally acquire and enhance their capabilities through learning.In this cycle,Video recommendation is fini...In the realm of contemporary artificial intelligence,machine learning enables automation,allowing systems to naturally acquire and enhance their capabilities through learning.In this cycle,Video recommendation is finished by utilizing machine learning strategies.A suggestion framework is an interaction of data sifting framework,which is utilized to foresee the“rating”or“inclination”given by the different clients.The expectation depends on past evaluations,history,interest,IMDB rating,and so on.This can be carried out by utilizing collective and substance-based separating approaches which utilize the data given by the different clients,examine them,and afterward suggest the video that suits the client at that specific time.The required datasets for the video are taken from Grouplens.This recommender framework is executed by utilizing Python Programming Language.For building this video recommender framework,two calculations are utilized,for example,K-implies Clustering and KNN grouping.K-implies is one of the unaided AI calculations and the fundamental goal is to bunch comparable sort of information focuses together and discover the examples.For that K-implies searches for a steady‘k'of bunches in a dataset.A group is an assortment of information focuses collected due to specific similitudes.K-Nearest Neighbor is an administered learning calculation utilized for characterization,with the given information;KNN can group new information by examination of the‘k'number of the closest information focuses.The last qualities acquired are through bunching qualities and root mean squared mistake,by using this algorithm we can recommend videos more appropriately based on user previous records and ratings.展开更多
Object detection plays a vital role in the video surveillance systems.To enhance security,surveillance cameras are now installed in public areas such as traffic signals,roadways,retail malls,train stations,and banks.Ho...Object detection plays a vital role in the video surveillance systems.To enhance security,surveillance cameras are now installed in public areas such as traffic signals,roadways,retail malls,train stations,and banks.However,monitor-ing the video continually at a quicker pace is a challenging job.As a consequence,security cameras are useless and need human monitoring.The primary difficulty with video surveillance is identifying abnormalities such as thefts,accidents,crimes,or other unlawful actions.The anomalous action does not occur at a high-er rate than usual occurrences.To detect the object in a video,first we analyze the images pixel by pixel.In digital image processing,segmentation is the process of segregating the individual image parts into pixels.The performance of segmenta-tion is affected by irregular illumination and/or low illumination.These factors highly affect the real-time object detection process in the video surveillance sys-tem.In this paper,a modified ResNet model(M-Resnet)is proposed to enhance the image which is affected by insufficient light.Experimental results provide the comparison of existing method output and modification architecture of the ResNet model shows the considerable amount improvement in detection objects in the video stream.The proposed model shows better results in the metrics like preci-sion,recall,pixel accuracy,etc.,andfinds a reasonable improvement in the object detection.展开更多
The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method in...The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.展开更多
Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions i...Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.展开更多
The mixed reality conference system proposed in this paper is a robust,real-time video conference application software that makes up for the simple interaction and lack of immersion and realism of traditional video co...The mixed reality conference system proposed in this paper is a robust,real-time video conference application software that makes up for the simple interaction and lack of immersion and realism of traditional video conference,which realizes the entire process of holographic video conference from client to cloud to the client.This paper mainly focuses on designing and implementing a video conference system based on AI segmentation technology and mixed reality.Several mixed reality conference system components are discussed,including data collection,data transmission,processing,and mixed reality presentation.The data layer is mainly used for data collection,integration,and video and audio codecs.The network layer uses Web-RTC to realize peer-to-peer data communication.The data processing layer is the core part of the system,mainly for human video matting and human-computer interaction,which is the key to realizing mixed reality conferences and improving the interactive experience.The presentation layer explicitly includes the login interface of the mixed reality conference system,the presentation of real-time matting of human subjects,and the presentation objects.With the mixed reality conference system,conference participants in different places can see each other in real-time in their mixed reality scene and share presentation content and 3D models based on mixed reality technology to have a more interactive and immersive experience.展开更多
The Learning management system(LMS)is now being used for uploading educational content in both distance and blended setups.LMS platform has two types of users:the educators who upload the content,and the students who ...The Learning management system(LMS)is now being used for uploading educational content in both distance and blended setups.LMS platform has two types of users:the educators who upload the content,and the students who have to access the content.The students,usually rely on text notes or books and video tutorials while their exams are conducted with formal methods.Formal assessments and examination criteria are ineffective with restricted learning space which makes the student tend only to read the educational contents and videos instead of interactive mode.The aim is to design an interactive LMS and examination video-based interface to cater the issues of educators and students.It is designed according to Human-computer interaction(HCI)principles to make the interactive User interface(UI)through User experience(UX).The interactive lectures in the form of annotated videos increase user engagement and improve the self-study context of users involved in LMS.The interface design defines how the design will interact with users and how the interface exchanges information.The findings show that interactive videos for LMS allow the users to have a more personalized learning experience by engaging in the educational content.The result shows a highly personalized learning experience due to the interactive video and quiz within the video.展开更多
Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate b...Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.展开更多
Videos represent the most prevailing form of digital media for communication,information dissemination,and monitoring.However,theirwidespread use has increased the risks of unauthorised access andmanipulation,posing s...Videos represent the most prevailing form of digital media for communication,information dissemination,and monitoring.However,theirwidespread use has increased the risks of unauthorised access andmanipulation,posing significant challenges.In response,various protection approaches have been developed to secure,authenticate,and ensure the integrity of digital videos.This study provides a comprehensive survey of the challenges associated with maintaining the confidentiality,integrity,and availability of video content,and examining how it can be manipulated.It then investigates current developments in the field of video security by exploring two critical research questions.First,it examine the techniques used by adversaries to compromise video data and evaluate their impact.Understanding these attack methodologies is crucial for developing effective defense mechanisms.Second,it explores the various security approaches that can be employed to protect video data,enhancing its transparency,integrity,and trustworthiness.It compares the effectiveness of these approaches across different use cases,including surveillance,video on demand(VoD),and medical videos related to disease diagnostics.Finally,it identifies potential research opportunities to enhance video data protection in response to the evolving threat landscape.Through this investigation,this study aims to contribute to the ongoing efforts in securing video data,providing insights that are vital for researchers,practitioners,and policymakers dedicated to enhancing the safety and reliability of video content in our digital world.展开更多
Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It...Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence.展开更多
Among steganalysis techniques,detection against MV(motion vector)domain-based video steganography in the HEVC(High Efficiency Video Coding)standard remains a challenging issue.For the purpose of improving the detectio...Among steganalysis techniques,detection against MV(motion vector)domain-based video steganography in the HEVC(High Efficiency Video Coding)standard remains a challenging issue.For the purpose of improving the detection performance,this paper proposes a steganalysis method that can perfectly detectMV-based steganography in HEVC.Firstly,we define the local optimality of MVP(Motion Vector Prediction)based on the technology of AMVP(Advanced Motion Vector Prediction).Secondly,we analyze that in HEVC video,message embedding either usingMVP index orMVD(Motion Vector Difference)may destroy the above optimality of MVP.And then,we define the optimal rate of MVP as a steganalysis feature.Finally,we conduct steganalysis detection experiments on two general datasets for three popular steganographymethods and compare the performance with four state-ofthe-art steganalysis methods.The experimental results demonstrate the effectiveness of the proposed feature set.Furthermore,our method stands out for its practical applicability,requiring no model training and exhibiting low computational complexity,making it a viable solution for real-world scenarios.展开更多
Cloud computing has drastically changed the delivery and consumption of live streaming content.The designs,challenges,and possible uses of cloud computing for live streaming are studied.A comprehensive overview of the...Cloud computing has drastically changed the delivery and consumption of live streaming content.The designs,challenges,and possible uses of cloud computing for live streaming are studied.A comprehensive overview of the technical and business issues surrounding cloudbased live streaming is provided,including the benefits of cloud computing,the various live streaming architectures,and the challenges that live streaming service providers face in delivering high‐quality,real‐time services.The different techniques used to improve the performance of video streaming,such as adaptive bit‐rate streaming,multicast distribution,and edge computing are discussed and the necessity of low‐latency and high‐quality video transmission in cloud‐based live streaming is underlined.Issues such as improving user experience and live streaming service performance using cutting‐edge technology,like artificial intelligence and machine learning are discussed.In addition,the legal and regulatory implications of cloud‐based live streaming,including issues with network neutrality,data privacy,and content moderation are addressed.The future of cloud computing for live streaming is examined in the section that follows,and it looks at the most likely new developments in terms of trends and technology.For technology vendors,live streaming service providers,and regulators,the findings have major policy‐relevant implications.Suggestions on how stakeholders should address these concerns and take advantage of the potential presented by this rapidly evolving sector,as well as insights into the key challenges and opportunities associated with cloud‐based live streaming are provided.展开更多
Due to the exponential growth of video data,aided by rapid advancements in multimedia technologies.It became difficult for the user to obtain information from a large video series.The process of providing an abstract ...Due to the exponential growth of video data,aided by rapid advancements in multimedia technologies.It became difficult for the user to obtain information from a large video series.The process of providing an abstract of the entire video that includes the most representative frames is known as static video summarization.This method resulted in rapid exploration,indexing,and retrieval of massive video libraries.We propose a framework for static video summary based on a Binary Robust Invariant Scalable Keypoint(BRISK)and bisecting K-means clustering algorithm.The current method effectively recognizes relevant frames using BRISK by extracting keypoints and the descriptors from video sequences.The video frames’BRISK features are clustered using a bisecting K-means,and the keyframe is determined by selecting the frame that is most near the cluster center.Without applying any clustering parameters,the appropriate clusters number is determined using the silhouette coefficient.Experiments were carried out on a publicly available open video project(OVP)dataset that contained videos of different genres.The proposed method’s effectiveness is compared to existing methods using a variety of evaluation metrics,and the proposed method achieves a trade-off between computational cost and quality.展开更多
In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is de...In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits.展开更多
Video streaming applications have grown considerably in recent years.As a result,this becomes one of the most significant contributors to global internet traffic.According to recent studies,the telecommunications indu...Video streaming applications have grown considerably in recent years.As a result,this becomes one of the most significant contributors to global internet traffic.According to recent studies,the telecommunications industry loses millions of dollars due to poor video Quality of Experience(QoE)for users.Among the standard proposals for standardizing the quality of video streaming over internet service providers(ISPs)is the Mean Opinion Score(MOS).However,the accurate finding of QoE by MOS is subjective and laborious,and it varies depending on the user.A fully automated data analytics framework is required to reduce the inter-operator variability characteristic in QoE assessment.This work addresses this concern by suggesting a novel hybrid XGBStackQoE analytical model using a two-level layering technique.Level one combines multiple Machine Learning(ML)models via a layer one Hybrid XGBStackQoE-model.Individual ML models at level one are trained using the entire training data set.The level two Hybrid XGBStackQoE-Model is fitted using the outputs(meta-features)of the layer one ML models.The proposed model outperformed the conventional models,with an accuracy improvement of 4 to 5 percent,which is still higher than the current traditional models.The proposed framework could significantly improve video QoE accuracy.展开更多
Video summarization aims to select key frames or key shots to create summaries for fast retrieval,compression,and efficient browsing of videos.Graph neural networks efficiently capture information about graph nodes an...Video summarization aims to select key frames or key shots to create summaries for fast retrieval,compression,and efficient browsing of videos.Graph neural networks efficiently capture information about graph nodes and their neighbors,but ignore the dynamic dependencies between nodes.To address this challenge,we propose an innovative Adaptive Graph Convolutional Adjacency Matrix Network(TAMGCN),leveraging the attention mechanism to dynamically adjust dependencies between graph nodes.Specifically,we first segment shots and extract features of each frame,then compute the representative features of each shot.Subsequently,we utilize the attention mechanism to dynamically adjust the adjacency matrix of the graph convolutional network to better capture the dynamic dependencies between graph nodes.Finally,we fuse temporal features extracted by Bi-directional Long Short-Term Memory network with structural features extracted by the graph convolutional network to generate high-quality summaries.Extensive experiments are conducted on two benchmark datasets,TVSum and SumMe,yielding F1-scores of 60.8%and 53.2%,respectively.Experimental results demonstrate that our method outperforms most state-of-the-art video summarization techniques.展开更多
Video watermarking plays a crucial role in protecting intellectual property rights and ensuring content authenticity.This study delves into the integration of Galois Field(GF)multiplication tables,especially GF(2^(4))...Video watermarking plays a crucial role in protecting intellectual property rights and ensuring content authenticity.This study delves into the integration of Galois Field(GF)multiplication tables,especially GF(2^(4)),and their interaction with distinct irreducible polynomials.The primary aim is to enhance watermarking techniques for achieving imperceptibility,robustness,and efficient execution time.The research employs scene selection and adaptive thresholding techniques to streamline the watermarking process.Scene selection is used strategically to embed watermarks in the most vital frames of the video,while adaptive thresholding methods ensure that the watermarking process adheres to imperceptibility criteria,maintaining the video's visual quality.Concurrently,careful consideration is given to execution time,crucial in real-world scenarios,to balance efficiency and efficacy.The Peak Signal-to-Noise Ratio(PSNR)serves as a pivotal metric to gauge the watermark's imperceptibility and video quality.The study explores various irreducible polynomials,navigating the trade-offs between computational efficiency and watermark imperceptibility.In parallel,the study pays careful attention to the execution time,a paramount consideration in real-world scenarios,to strike a balance between efficiency and efficacy.This comprehensive analysis provides valuable insights into the interplay of GF multiplication tables,diverse irreducible polynomials,scene selection,adaptive thresholding,imperceptibility,and execution time.The evaluation of the proposed algorithm's robustness was conducted using PSNR and NC metrics,and it was subjected to assessment under the impact of five distinct attack scenarios.These findings contribute to the development of watermarking strategies that balance imperceptibility,robustness,and processing efficiency,enhancing the field's practicality and effectiveness.展开更多
Recent research advances in implicit neural representation have shown that a wide range of video data distributions are achieved by sharing model weights for Neural Representation for Videos(NeRV).While explicit metho...Recent research advances in implicit neural representation have shown that a wide range of video data distributions are achieved by sharing model weights for Neural Representation for Videos(NeRV).While explicit methods exist for accurately embedding ownership or copyright information in video data,the nascent NeRV framework has yet to address this issue comprehensively.In response,this paper introduces MarkINeRV,a scheme designed to embed watermarking information into video frames using an invertible neural network watermarking approach to protect the copyright of NeRV,which models the embedding and extraction of watermarks as a pair of inverse processes of a reversible network and employs the same network to achieve embedding and extraction of watermarks.It is just that the information flow is in the opposite direction.Additionally,a video frame quality enhancement module is incorporated to mitigate watermarking information losses in the rendering process and the possibility ofmalicious attacks during transmission,ensuring the accurate extraction of watermarking information through the invertible network’s inverse process.This paper evaluates the accuracy,robustness,and invisibility of MarkINeRV through multiple video datasets.The results demonstrate its efficacy in extracting watermarking information for copyright protection of NeRV.MarkINeRV represents a pioneering investigation into copyright issues surrounding NeRV.展开更多
基金supported in part by the National Natural Science Foundation of China(NSFC) under Grant Nos.61501216,61402303,61522103the Science and Technology Plan Projects(Openness & Cooperation) of Henan province(152106000048)
文摘The large-scale and sudden video content access such as flash crowds results in huge bandwidth demand,which severely influence user quality of experience and quality of service of video systems.In this paper,we firstly discuss the main reason of generation of flash crowds for video streaming services and analyze key factor for balance recovery between supply and demand of upload bandwidth.We construct two models:bandwidth supply capacity model of video systems and bandwidth demand model of users,which measures usage amount of bandwidth of the cloud.Based on the built models,we further employ a community-based cooperative caching strategy of video resources to promote supply capacity of upload bandwidth of video systems.Extensive tests show how the proposed cooperative caching strategy achieves much better performance results in comparison with original solution.
基金the National Natural Science Foundation of China (No.60472100)the Natural Science Foundation of Zhejiang Province (No.Y105577)the Key Project of Chinese Ministry of Education (No.206059).
文摘Ray-space based arbitrary viewpoint rendering without complex object segmentation or model construction is the main technology to realize Free Viewpoint Video(FVV) system for complex scenes. Ray-space interpolation and compression are two key techniques for the solution. In this paper,correlation among multiple epipolar lines in ray-space data is analyzed,and a new method of ray-space interpolation with multi-epipolar lines matching is proposed. Comparing with the pixel-based matching interpolation method and the block-based matching interpolation method,the proposed method can achieve higher Peak Signal to Noise Ratio(PSNR) in interpolating rayspace data and rendering arbitrary viewpoint images.
文摘In the realm of contemporary artificial intelligence,machine learning enables automation,allowing systems to naturally acquire and enhance their capabilities through learning.In this cycle,Video recommendation is finished by utilizing machine learning strategies.A suggestion framework is an interaction of data sifting framework,which is utilized to foresee the“rating”or“inclination”given by the different clients.The expectation depends on past evaluations,history,interest,IMDB rating,and so on.This can be carried out by utilizing collective and substance-based separating approaches which utilize the data given by the different clients,examine them,and afterward suggest the video that suits the client at that specific time.The required datasets for the video are taken from Grouplens.This recommender framework is executed by utilizing Python Programming Language.For building this video recommender framework,two calculations are utilized,for example,K-implies Clustering and KNN grouping.K-implies is one of the unaided AI calculations and the fundamental goal is to bunch comparable sort of information focuses together and discover the examples.For that K-implies searches for a steady‘k'of bunches in a dataset.A group is an assortment of information focuses collected due to specific similitudes.K-Nearest Neighbor is an administered learning calculation utilized for characterization,with the given information;KNN can group new information by examination of the‘k'number of the closest information focuses.The last qualities acquired are through bunching qualities and root mean squared mistake,by using this algorithm we can recommend videos more appropriately based on user previous records and ratings.
文摘Object detection plays a vital role in the video surveillance systems.To enhance security,surveillance cameras are now installed in public areas such as traffic signals,roadways,retail malls,train stations,and banks.However,monitor-ing the video continually at a quicker pace is a challenging job.As a consequence,security cameras are useless and need human monitoring.The primary difficulty with video surveillance is identifying abnormalities such as thefts,accidents,crimes,or other unlawful actions.The anomalous action does not occur at a high-er rate than usual occurrences.To detect the object in a video,first we analyze the images pixel by pixel.In digital image processing,segmentation is the process of segregating the individual image parts into pixels.The performance of segmenta-tion is affected by irregular illumination and/or low illumination.These factors highly affect the real-time object detection process in the video surveillance sys-tem.In this paper,a modified ResNet model(M-Resnet)is proposed to enhance the image which is affected by insufficient light.Experimental results provide the comparison of existing method output and modification architecture of the ResNet model shows the considerable amount improvement in detection objects in the video stream.The proposed model shows better results in the metrics like preci-sion,recall,pixel accuracy,etc.,andfinds a reasonable improvement in the object detection.
基金Science and Technology Funds from the Liaoning Education Department(Serial Number:LJKZ0104).
文摘The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.
文摘Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.
基金supported in part by the Major Fundamental Research of Natural Science Foundation of Shandong Province under Grant ZR2019ZD05Joint fund for smart computing of Shandong Natural Science Foundation under Grant ZR2020LZH013+1 种基金Open project of State Key Laboratory of Computer Architecture CARCHA202002Human Video Matting Project of Hisense Co.,Ltd.under Grant QD1170020023.
文摘The mixed reality conference system proposed in this paper is a robust,real-time video conference application software that makes up for the simple interaction and lack of immersion and realism of traditional video conference,which realizes the entire process of holographic video conference from client to cloud to the client.This paper mainly focuses on designing and implementing a video conference system based on AI segmentation technology and mixed reality.Several mixed reality conference system components are discussed,including data collection,data transmission,processing,and mixed reality presentation.The data layer is mainly used for data collection,integration,and video and audio codecs.The network layer uses Web-RTC to realize peer-to-peer data communication.The data processing layer is the core part of the system,mainly for human video matting and human-computer interaction,which is the key to realizing mixed reality conferences and improving the interactive experience.The presentation layer explicitly includes the login interface of the mixed reality conference system,the presentation of real-time matting of human subjects,and the presentation objects.With the mixed reality conference system,conference participants in different places can see each other in real-time in their mixed reality scene and share presentation content and 3D models based on mixed reality technology to have a more interactive and immersive experience.
文摘The Learning management system(LMS)is now being used for uploading educational content in both distance and blended setups.LMS platform has two types of users:the educators who upload the content,and the students who have to access the content.The students,usually rely on text notes or books and video tutorials while their exams are conducted with formal methods.Formal assessments and examination criteria are ineffective with restricted learning space which makes the student tend only to read the educational contents and videos instead of interactive mode.The aim is to design an interactive LMS and examination video-based interface to cater the issues of educators and students.It is designed according to Human-computer interaction(HCI)principles to make the interactive User interface(UI)through User experience(UX).The interactive lectures in the form of annotated videos increase user engagement and improve the self-study context of users involved in LMS.The interface design defines how the design will interact with users and how the interface exchanges information.The findings show that interactive videos for LMS allow the users to have a more personalized learning experience by engaging in the educational content.The result shows a highly personalized learning experience due to the interactive video and quiz within the video.
基金supported by the Key Research Program of the Chinese Academy of Sciences(grant number ZDRW-ZS-2021-1-2).
文摘Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.
基金funded by the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie Action(MSCA)grant agreement No.101109961.
文摘Videos represent the most prevailing form of digital media for communication,information dissemination,and monitoring.However,theirwidespread use has increased the risks of unauthorised access andmanipulation,posing significant challenges.In response,various protection approaches have been developed to secure,authenticate,and ensure the integrity of digital videos.This study provides a comprehensive survey of the challenges associated with maintaining the confidentiality,integrity,and availability of video content,and examining how it can be manipulated.It then investigates current developments in the field of video security by exploring two critical research questions.First,it examine the techniques used by adversaries to compromise video data and evaluate their impact.Understanding these attack methodologies is crucial for developing effective defense mechanisms.Second,it explores the various security approaches that can be employed to protect video data,enhancing its transparency,integrity,and trustworthiness.It compares the effectiveness of these approaches across different use cases,including surveillance,video on demand(VoD),and medical videos related to disease diagnostics.Finally,it identifies potential research opportunities to enhance video data protection in response to the evolving threat landscape.Through this investigation,this study aims to contribute to the ongoing efforts in securing video data,providing insights that are vital for researchers,practitioners,and policymakers dedicated to enhancing the safety and reliability of video content in our digital world.
文摘Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence.
基金the National Natural Science Foundation of China(Grant Nos.62272478,62202496,61872384).
文摘Among steganalysis techniques,detection against MV(motion vector)domain-based video steganography in the HEVC(High Efficiency Video Coding)standard remains a challenging issue.For the purpose of improving the detection performance,this paper proposes a steganalysis method that can perfectly detectMV-based steganography in HEVC.Firstly,we define the local optimality of MVP(Motion Vector Prediction)based on the technology of AMVP(Advanced Motion Vector Prediction).Secondly,we analyze that in HEVC video,message embedding either usingMVP index orMVD(Motion Vector Difference)may destroy the above optimality of MVP.And then,we define the optimal rate of MVP as a steganalysis feature.Finally,we conduct steganalysis detection experiments on two general datasets for three popular steganographymethods and compare the performance with four state-ofthe-art steganalysis methods.The experimental results demonstrate the effectiveness of the proposed feature set.Furthermore,our method stands out for its practical applicability,requiring no model training and exhibiting low computational complexity,making it a viable solution for real-world scenarios.
文摘Cloud computing has drastically changed the delivery and consumption of live streaming content.The designs,challenges,and possible uses of cloud computing for live streaming are studied.A comprehensive overview of the technical and business issues surrounding cloudbased live streaming is provided,including the benefits of cloud computing,the various live streaming architectures,and the challenges that live streaming service providers face in delivering high‐quality,real‐time services.The different techniques used to improve the performance of video streaming,such as adaptive bit‐rate streaming,multicast distribution,and edge computing are discussed and the necessity of low‐latency and high‐quality video transmission in cloud‐based live streaming is underlined.Issues such as improving user experience and live streaming service performance using cutting‐edge technology,like artificial intelligence and machine learning are discussed.In addition,the legal and regulatory implications of cloud‐based live streaming,including issues with network neutrality,data privacy,and content moderation are addressed.The future of cloud computing for live streaming is examined in the section that follows,and it looks at the most likely new developments in terms of trends and technology.For technology vendors,live streaming service providers,and regulators,the findings have major policy‐relevant implications.Suggestions on how stakeholders should address these concerns and take advantage of the potential presented by this rapidly evolving sector,as well as insights into the key challenges and opportunities associated with cloud‐based live streaming are provided.
基金The authors would like to thank Research Supporting Project Number(RSP2024R444)King Saud University,Riyadh,Saudi Arabia.
文摘Due to the exponential growth of video data,aided by rapid advancements in multimedia technologies.It became difficult for the user to obtain information from a large video series.The process of providing an abstract of the entire video that includes the most representative frames is known as static video summarization.This method resulted in rapid exploration,indexing,and retrieval of massive video libraries.We propose a framework for static video summary based on a Binary Robust Invariant Scalable Keypoint(BRISK)and bisecting K-means clustering algorithm.The current method effectively recognizes relevant frames using BRISK by extracting keypoints and the descriptors from video sequences.The video frames’BRISK features are clustered using a bisecting K-means,and the keyframe is determined by selecting the frame that is most near the cluster center.Without applying any clustering parameters,the appropriate clusters number is determined using the silhouette coefficient.Experiments were carried out on a publicly available open video project(OVP)dataset that contained videos of different genres.The proposed method’s effectiveness is compared to existing methods using a variety of evaluation metrics,and the proposed method achieves a trade-off between computational cost and quality.
基金supported in part by the National Natural Science Foundation of China under Grant 61873277in part by the Natural Science Basic Research Plan in Shaanxi Province of China underGrant 2020JQ-758in part by the Chinese Postdoctoral Science Foundation under Grant 2020M673446.
文摘In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits.
文摘Video streaming applications have grown considerably in recent years.As a result,this becomes one of the most significant contributors to global internet traffic.According to recent studies,the telecommunications industry loses millions of dollars due to poor video Quality of Experience(QoE)for users.Among the standard proposals for standardizing the quality of video streaming over internet service providers(ISPs)is the Mean Opinion Score(MOS).However,the accurate finding of QoE by MOS is subjective and laborious,and it varies depending on the user.A fully automated data analytics framework is required to reduce the inter-operator variability characteristic in QoE assessment.This work addresses this concern by suggesting a novel hybrid XGBStackQoE analytical model using a two-level layering technique.Level one combines multiple Machine Learning(ML)models via a layer one Hybrid XGBStackQoE-model.Individual ML models at level one are trained using the entire training data set.The level two Hybrid XGBStackQoE-Model is fitted using the outputs(meta-features)of the layer one ML models.The proposed model outperformed the conventional models,with an accuracy improvement of 4 to 5 percent,which is still higher than the current traditional models.The proposed framework could significantly improve video QoE accuracy.
基金This work was supported by Natural Science Foundation of Gansu Province under Grant Nos.21JR7RA570,20JR10RA334Basic Research Program of Gansu Province No.22JR11RA106,Gansu University of Political Science and Law Major Scientific Research and Innovation Projects under Grant No.GZF2020XZDA03+1 种基金the Young Doctoral Fund Project of Higher Education Institutions in Gansu Province in 2022 under Grant No.2022QB-123,Gansu Province Higher Education Innovation Fund Project under Grant No.2022A-097the University-Level Research Funding Project under Grant No.GZFXQNLW022 and University-Level Innovative Research Team of Gansu University of Political Science and Law.
文摘Video summarization aims to select key frames or key shots to create summaries for fast retrieval,compression,and efficient browsing of videos.Graph neural networks efficiently capture information about graph nodes and their neighbors,but ignore the dynamic dependencies between nodes.To address this challenge,we propose an innovative Adaptive Graph Convolutional Adjacency Matrix Network(TAMGCN),leveraging the attention mechanism to dynamically adjust dependencies between graph nodes.Specifically,we first segment shots and extract features of each frame,then compute the representative features of each shot.Subsequently,we utilize the attention mechanism to dynamically adjust the adjacency matrix of the graph convolutional network to better capture the dynamic dependencies between graph nodes.Finally,we fuse temporal features extracted by Bi-directional Long Short-Term Memory network with structural features extracted by the graph convolutional network to generate high-quality summaries.Extensive experiments are conducted on two benchmark datasets,TVSum and SumMe,yielding F1-scores of 60.8%and 53.2%,respectively.Experimental results demonstrate that our method outperforms most state-of-the-art video summarization techniques.
文摘Video watermarking plays a crucial role in protecting intellectual property rights and ensuring content authenticity.This study delves into the integration of Galois Field(GF)multiplication tables,especially GF(2^(4)),and their interaction with distinct irreducible polynomials.The primary aim is to enhance watermarking techniques for achieving imperceptibility,robustness,and efficient execution time.The research employs scene selection and adaptive thresholding techniques to streamline the watermarking process.Scene selection is used strategically to embed watermarks in the most vital frames of the video,while adaptive thresholding methods ensure that the watermarking process adheres to imperceptibility criteria,maintaining the video's visual quality.Concurrently,careful consideration is given to execution time,crucial in real-world scenarios,to balance efficiency and efficacy.The Peak Signal-to-Noise Ratio(PSNR)serves as a pivotal metric to gauge the watermark's imperceptibility and video quality.The study explores various irreducible polynomials,navigating the trade-offs between computational efficiency and watermark imperceptibility.In parallel,the study pays careful attention to the execution time,a paramount consideration in real-world scenarios,to strike a balance between efficiency and efficacy.This comprehensive analysis provides valuable insights into the interplay of GF multiplication tables,diverse irreducible polynomials,scene selection,adaptive thresholding,imperceptibility,and execution time.The evaluation of the proposed algorithm's robustness was conducted using PSNR and NC metrics,and it was subjected to assessment under the impact of five distinct attack scenarios.These findings contribute to the development of watermarking strategies that balance imperceptibility,robustness,and processing efficiency,enhancing the field's practicality and effectiveness.
基金supported by the National Natural Science Foundation of China,with Fund Numbers 62272478,62102451the National Defense Science and Technology Independent Research Project(Intelligent Information Hiding Technology and Its Applications in a Certain Field)and Science and Technology Innovation Team Innovative Research Project“Research on Key Technologies for Intelligent Information Hiding”with Fund Number ZZKY20222102.
文摘Recent research advances in implicit neural representation have shown that a wide range of video data distributions are achieved by sharing model weights for Neural Representation for Videos(NeRV).While explicit methods exist for accurately embedding ownership or copyright information in video data,the nascent NeRV framework has yet to address this issue comprehensively.In response,this paper introduces MarkINeRV,a scheme designed to embed watermarking information into video frames using an invertible neural network watermarking approach to protect the copyright of NeRV,which models the embedding and extraction of watermarks as a pair of inverse processes of a reversible network and employs the same network to achieve embedding and extraction of watermarks.It is just that the information flow is in the opposite direction.Additionally,a video frame quality enhancement module is incorporated to mitigate watermarking information losses in the rendering process and the possibility ofmalicious attacks during transmission,ensuring the accurate extraction of watermarking information through the invertible network’s inverse process.This paper evaluates the accuracy,robustness,and invisibility of MarkINeRV through multiple video datasets.The results demonstrate its efficacy in extracting watermarking information for copyright protection of NeRV.MarkINeRV represents a pioneering investigation into copyright issues surrounding NeRV.