In the intricate network environment,the secure transmission of medical images faces challenges such as information leakage and malicious tampering,significantly impacting the accuracy of disease diagnoses by medical ...In the intricate network environment,the secure transmission of medical images faces challenges such as information leakage and malicious tampering,significantly impacting the accuracy of disease diagnoses by medical professionals.To address this problem,the authors propose a robust feature watermarking algorithm for encrypted medical images based on multi-stage discrete wavelet transform(DWT),Daisy descriptor,and discrete cosine transform(DCT).The algorithm initially encrypts the original medical image through DWT-DCT and Logistic mapping.Subsequently,a 3-stage DWT transformation is applied to the encrypted medical image,with the centre point of the LL3 sub-band within its low-frequency component serving as the sampling point.The Daisy descriptor matrix for this point is then computed.Finally,a DCT transformation is performed on the Daisy descriptor matrix,and the low-frequency portion is processed using the perceptual hashing algorithm to generate a 32-bit binary feature vector for the medical image.This scheme utilises cryptographic knowledge and zero-watermarking technique to embed watermarks without modifying medical images and can extract the watermark from test images without the original image,which meets the basic re-quirements of medical image watermarking.The embedding and extraction of water-marks are accomplished in a mere 0.160 and 0.411s,respectively,with minimal computational overhead.Simulation results demonstrate the robustness of the algorithm against both conventional attacks and geometric attacks,with a notable performance in resisting rotation attacks.展开更多
The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method in...The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.展开更多
The field of healthcare is considered to be the most promising application of intelligent sensor networks.However,the security and privacy protection ofmedical images collected by intelligent sensor networks is a hot ...The field of healthcare is considered to be the most promising application of intelligent sensor networks.However,the security and privacy protection ofmedical images collected by intelligent sensor networks is a hot problem that has attracted more and more attention.Fortunately,digital watermarking provides an effective method to solve this problem.In order to improve the robustness of the medical image watermarking scheme,in this paper,we propose a novel zero-watermarking algorithm with the integer wavelet transform(IWT),Schur decomposition and image block energy.Specifically,we first use IWT to extract low-frequency information and divide them into non-overlapping blocks,then we decompose the sub-blocks by Schur decomposition.After that,the feature matrix is constructed according to the relationship between the image block energy and the whole image energy.At the same time,we encrypt watermarking with the logistic chaotic position scrambling.Finally,the zero-watermarking is obtained by XOR operation with the encrypted watermarking.Three indexes of peak signal-to-noise ratio,normalization coefficient(NC)and the bit error rate(BER)are used to evaluate the robustness of the algorithm.According to the experimental results,most of the NC values are around 0.9 under various attacks,while the BER values are very close to 0.These experimental results show that the proposed algorithm is more robust than the existing zero-watermarking methods,which indicates it is more suitable for medical image privacy and security protection.展开更多
Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions i...Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.展开更多
Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate b...Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.展开更多
Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It...Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence.展开更多
Videos represent the most prevailing form of digital media for communication,information dissemination,and monitoring.However,theirwidespread use has increased the risks of unauthorised access andmanipulation,posing s...Videos represent the most prevailing form of digital media for communication,information dissemination,and monitoring.However,theirwidespread use has increased the risks of unauthorised access andmanipulation,posing significant challenges.In response,various protection approaches have been developed to secure,authenticate,and ensure the integrity of digital videos.This study provides a comprehensive survey of the challenges associated with maintaining the confidentiality,integrity,and availability of video content,and examining how it can be manipulated.It then investigates current developments in the field of video security by exploring two critical research questions.First,it examine the techniques used by adversaries to compromise video data and evaluate their impact.Understanding these attack methodologies is crucial for developing effective defense mechanisms.Second,it explores the various security approaches that can be employed to protect video data,enhancing its transparency,integrity,and trustworthiness.It compares the effectiveness of these approaches across different use cases,including surveillance,video on demand(VoD),and medical videos related to disease diagnostics.Finally,it identifies potential research opportunities to enhance video data protection in response to the evolving threat landscape.Through this investigation,this study aims to contribute to the ongoing efforts in securing video data,providing insights that are vital for researchers,practitioners,and policymakers dedicated to enhancing the safety and reliability of video content in our digital world.展开更多
Cloud computing has drastically changed the delivery and consumption of live streaming content.The designs,challenges,and possible uses of cloud computing for live streaming are studied.A comprehensive overview of the...Cloud computing has drastically changed the delivery and consumption of live streaming content.The designs,challenges,and possible uses of cloud computing for live streaming are studied.A comprehensive overview of the technical and business issues surrounding cloudbased live streaming is provided,including the benefits of cloud computing,the various live streaming architectures,and the challenges that live streaming service providers face in delivering high‐quality,real‐time services.The different techniques used to improve the performance of video streaming,such as adaptive bit‐rate streaming,multicast distribution,and edge computing are discussed and the necessity of low‐latency and high‐quality video transmission in cloud‐based live streaming is underlined.Issues such as improving user experience and live streaming service performance using cutting‐edge technology,like artificial intelligence and machine learning are discussed.In addition,the legal and regulatory implications of cloud‐based live streaming,including issues with network neutrality,data privacy,and content moderation are addressed.The future of cloud computing for live streaming is examined in the section that follows,and it looks at the most likely new developments in terms of trends and technology.For technology vendors,live streaming service providers,and regulators,the findings have major policy‐relevant implications.Suggestions on how stakeholders should address these concerns and take advantage of the potential presented by this rapidly evolving sector,as well as insights into the key challenges and opportunities associated with cloud‐based live streaming are provided.展开更多
Among steganalysis techniques,detection against MV(motion vector)domain-based video steganography in the HEVC(High Efficiency Video Coding)standard remains a challenging issue.For the purpose of improving the detectio...Among steganalysis techniques,detection against MV(motion vector)domain-based video steganography in the HEVC(High Efficiency Video Coding)standard remains a challenging issue.For the purpose of improving the detection performance,this paper proposes a steganalysis method that can perfectly detectMV-based steganography in HEVC.Firstly,we define the local optimality of MVP(Motion Vector Prediction)based on the technology of AMVP(Advanced Motion Vector Prediction).Secondly,we analyze that in HEVC video,message embedding either usingMVP index orMVD(Motion Vector Difference)may destroy the above optimality of MVP.And then,we define the optimal rate of MVP as a steganalysis feature.Finally,we conduct steganalysis detection experiments on two general datasets for three popular steganographymethods and compare the performance with four state-ofthe-art steganalysis methods.The experimental results demonstrate the effectiveness of the proposed feature set.Furthermore,our method stands out for its practical applicability,requiring no model training and exhibiting low computational complexity,making it a viable solution for real-world scenarios.展开更多
With the advancement of video recording devices and network infrastructure,we use surveillance cameras to protect our valuable assets.This paper proposes a novel system for encrypting personal information within recor...With the advancement of video recording devices and network infrastructure,we use surveillance cameras to protect our valuable assets.This paper proposes a novel system for encrypting personal information within recorded surveillance videos to enhance efficiency and security.The proposed method leverages Dlib’s CNN-based facial recognition technology to identify Regions of Interest(ROIs)within the video,linking these ROIs to generate unique IDs.These IDs are then combined with a master key to create entity-specific keys,which are used to encrypt the ROIs within the video.This system supports selective decryption,effectively protecting personal information using surveillance footage.Additionally,the system overcomes the limitations of existing ROI recognition technologies by predicting unrecognized frames through post-processing.This research validates the proposed technology through experimental evaluations of execution time and post-processing techniques,ensuring comprehensive personal information protection.Guidelines for setting the thresholds used in this process are also provided.Implementing the proposed method could serve as an effective solution to security vulnerabilities that traditional approaches fail to address.展开更多
Objective: To study the problematic use of video games among secondary school students in the city of Parakou in 2023. Methods: Descriptive cross-sectional study conducted in the commune of Parakou from December 2022 ...Objective: To study the problematic use of video games among secondary school students in the city of Parakou in 2023. Methods: Descriptive cross-sectional study conducted in the commune of Parakou from December 2022 to July 2023. The study population consisted of students regularly enrolled in public and private secondary schools in the city of Parakou for the 2022-2023 academic year. A two-stage non-proportional stratified sampling technique combined with simple random sampling was adopted. The Problem Video Game Playing (PVP) scale was used to assess problem gambling in the study population, while anxiety and depression were assessed using the Hospital Anxiety and Depression Scale (HADS). Results: A total of 1030 students were included. The mean age of the pupils surveyed was 15.06 ± 2.68 years, with extremes of 10 and 28 years. The [13 - 18] age group was the most represented, with a proportion of 59.6% (614) in the general population. Females predominated, at 52.8% (544), with a sex ratio of 0.89. The prevalence of problematic video game use was 24.9%, measured using the Video Game Playing scale. Associated factors were male gender (p = 0.005), pocket money under 10,000 cfa (p = 0.001) and between 20,000 - 90,000 cfa (p = 0.030), addictive family behavior (p < 0.001), monogamous family (p = 0.023), good relationship with father (p = 0.020), organization of video game competitions (p = 0.001) and definite anxiety (p Conclusion: Substance-free addiction is struggling to attract the attention it deserves, as it did in its infancy everywhere else. This study complements existing data and serves as a reminder of the need to focus on this group of addictions, whose problematic use of video games remains the most frequent due to its accessibility and social tolerance. Preventive action combined with curative measures remains the most effective means of combating the problem at national level.展开更多
Due to the exponential growth of video data,aided by rapid advancements in multimedia technologies.It became difficult for the user to obtain information from a large video series.The process of providing an abstract ...Due to the exponential growth of video data,aided by rapid advancements in multimedia technologies.It became difficult for the user to obtain information from a large video series.The process of providing an abstract of the entire video that includes the most representative frames is known as static video summarization.This method resulted in rapid exploration,indexing,and retrieval of massive video libraries.We propose a framework for static video summary based on a Binary Robust Invariant Scalable Keypoint(BRISK)and bisecting K-means clustering algorithm.The current method effectively recognizes relevant frames using BRISK by extracting keypoints and the descriptors from video sequences.The video frames’BRISK features are clustered using a bisecting K-means,and the keyframe is determined by selecting the frame that is most near the cluster center.Without applying any clustering parameters,the appropriate clusters number is determined using the silhouette coefficient.Experiments were carried out on a publicly available open video project(OVP)dataset that contained videos of different genres.The proposed method’s effectiveness is compared to existing methods using a variety of evaluation metrics,and the proposed method achieves a trade-off between computational cost and quality.展开更多
In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is de...In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits.展开更多
Recent research advances in implicit neural representation have shown that a wide range of video data distributions are achieved by sharing model weights for Neural Representation for Videos(NeRV).While explicit metho...Recent research advances in implicit neural representation have shown that a wide range of video data distributions are achieved by sharing model weights for Neural Representation for Videos(NeRV).While explicit methods exist for accurately embedding ownership or copyright information in video data,the nascent NeRV framework has yet to address this issue comprehensively.In response,this paper introduces MarkINeRV,a scheme designed to embed watermarking information into video frames using an invertible neural network watermarking approach to protect the copyright of NeRV,which models the embedding and extraction of watermarks as a pair of inverse processes of a reversible network and employs the same network to achieve embedding and extraction of watermarks.It is just that the information flow is in the opposite direction.Additionally,a video frame quality enhancement module is incorporated to mitigate watermarking information losses in the rendering process and the possibility ofmalicious attacks during transmission,ensuring the accurate extraction of watermarking information through the invertible network’s inverse process.This paper evaluates the accuracy,robustness,and invisibility of MarkINeRV through multiple video datasets.The results demonstrate its efficacy in extracting watermarking information for copyright protection of NeRV.MarkINeRV represents a pioneering investigation into copyright issues surrounding NeRV.展开更多
High-resolution video transmission requires a substantial amount of bandwidth.In this paper,we present a novel video processing methodology that innovatively integrates region of interest(ROI)identification and super-...High-resolution video transmission requires a substantial amount of bandwidth.In this paper,we present a novel video processing methodology that innovatively integrates region of interest(ROI)identification and super-resolution enhancement.Our method commences with the accurate detection of ROIs within video sequences,followed by the application of advanced super-resolution techniques to these areas,thereby preserving visual quality while economizing on data transmission.To validate and benchmark our approach,we have curated a new gaming dataset tailored to evaluate the effectiveness of ROI-based super-resolution in practical applications.The proposed model architecture leverages the transformer network framework,guided by a carefully designed multi-task loss function,which facilitates concurrent learning and execution of both ROI identification and resolution enhancement tasks.This unified deep learning model exhibits remarkable performance in achieving super-resolution on our custom dataset.The implications of this research extend to optimizing low-bitrate video streaming scenarios.By selectively enhancing the resolution of critical regions in videos,our solution enables high-quality video delivery under constrained bandwidth conditions.Empirical results demonstrate a 15%reduction in transmission bandwidth compared to traditional super-resolution based compression methods,without any perceivable decline in visual quality.This work thus contributes to the advancement of video compression and enhancement technologies,offering an effective strategy for improving digital media delivery efficiency and user experience,especially in bandwidth-limited environments.The innovative integration of ROI identification and super-resolution presents promising avenues for future research and development in adaptive and intelligent video communication systems.展开更多
Video summarization aims to select key frames or key shots to create summaries for fast retrieval,compression,and efficient browsing of videos.Graph neural networks efficiently capture information about graph nodes an...Video summarization aims to select key frames or key shots to create summaries for fast retrieval,compression,and efficient browsing of videos.Graph neural networks efficiently capture information about graph nodes and their neighbors,but ignore the dynamic dependencies between nodes.To address this challenge,we propose an innovative Adaptive Graph Convolutional Adjacency Matrix Network(TAMGCN),leveraging the attention mechanism to dynamically adjust dependencies between graph nodes.Specifically,we first segment shots and extract features of each frame,then compute the representative features of each shot.Subsequently,we utilize the attention mechanism to dynamically adjust the adjacency matrix of the graph convolutional network to better capture the dynamic dependencies between graph nodes.Finally,we fuse temporal features extracted by Bi-directional Long Short-Term Memory network with structural features extracted by the graph convolutional network to generate high-quality summaries.Extensive experiments are conducted on two benchmark datasets,TVSum and SumMe,yielding F1-scores of 60.8%and 53.2%,respectively.Experimental results demonstrate that our method outperforms most state-of-the-art video summarization techniques.展开更多
In this paper,we explore a distributed collaborative caching and computing model to support the distribution of adaptive bit rate video streaming.The aim is to reduce the average initial buffer delay and improve the q...In this paper,we explore a distributed collaborative caching and computing model to support the distribution of adaptive bit rate video streaming.The aim is to reduce the average initial buffer delay and improve the quality of user experience.Considering the difference between global and local video popularities and the time-varying characteristics of video popularity,a two-stage caching scheme is proposed to push popular videos closer to users and minimize the average initial buffer delay.Based on both long-term content popularity and short-term content popularity,the proposed caching solution is decouple into the proactive cache stage and the cache update stage.In the proactive cache stage,we develop a proactive cache placement algorithm that can be executed in an off-peak period.In the cache update stage,we propose a reactive cache update algorithm to update the existing cache policy to minimize the buffer delay.Simulation results verify that the proposed caching algorithms can reduce the initial buffer delay efficiently.展开更多
Video streaming applications have grown considerably in recent years.As a result,this becomes one of the most significant contributors to global internet traffic.According to recent studies,the telecommunications indu...Video streaming applications have grown considerably in recent years.As a result,this becomes one of the most significant contributors to global internet traffic.According to recent studies,the telecommunications industry loses millions of dollars due to poor video Quality of Experience(QoE)for users.Among the standard proposals for standardizing the quality of video streaming over internet service providers(ISPs)is the Mean Opinion Score(MOS).However,the accurate finding of QoE by MOS is subjective and laborious,and it varies depending on the user.A fully automated data analytics framework is required to reduce the inter-operator variability characteristic in QoE assessment.This work addresses this concern by suggesting a novel hybrid XGBStackQoE analytical model using a two-level layering technique.Level one combines multiple Machine Learning(ML)models via a layer one Hybrid XGBStackQoE-model.Individual ML models at level one are trained using the entire training data set.The level two Hybrid XGBStackQoE-Model is fitted using the outputs(meta-features)of the layer one ML models.The proposed model outperformed the conventional models,with an accuracy improvement of 4 to 5 percent,which is still higher than the current traditional models.The proposed framework could significantly improve video QoE accuracy.展开更多
Recently,there have been significant advancements in the study of semantic communication in single-modal scenarios.However,the ability to process information in multi-modal environments remains limited.Inspired by the...Recently,there have been significant advancements in the study of semantic communication in single-modal scenarios.However,the ability to process information in multi-modal environments remains limited.Inspired by the research and applications of natural language processing across different modalities,our goal is to accurately extract frame-level semantic information from videos and ultimately transmit high-quality videos.Specifically,we propose a deep learning-basedMulti-ModalMutual Enhancement Video Semantic Communication system,called M3E-VSC.Built upon a VectorQuantized Generative AdversarialNetwork(VQGAN),our systemaims to leverage mutual enhancement among different modalities by using text as the main carrier of transmission.With it,the semantic information can be extracted fromkey-frame images and audio of the video and performdifferential value to ensure that the extracted text conveys accurate semantic information with fewer bits,thus improving the capacity of the system.Furthermore,a multi-frame semantic detection module is designed to facilitate semantic transitions during video generation.Simulation results demonstrate that our proposed model maintains high robustness in complex noise environments,particularly in low signal-to-noise ratio conditions,significantly improving the accuracy and speed of semantic transmission in video communication by approximately 50 percent.展开更多
Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing com...Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.展开更多
基金National Natural Science Foundation of China,Grant/Award Numbers:62063004,62350410483Key Research and Development Project of Hainan Province,Grant/Award Number:ZDYF2021SHFZ093Zhejiang Provincial Postdoctoral Science Foundation,Grant/Award Number:ZJ2021028。
文摘In the intricate network environment,the secure transmission of medical images faces challenges such as information leakage and malicious tampering,significantly impacting the accuracy of disease diagnoses by medical professionals.To address this problem,the authors propose a robust feature watermarking algorithm for encrypted medical images based on multi-stage discrete wavelet transform(DWT),Daisy descriptor,and discrete cosine transform(DCT).The algorithm initially encrypts the original medical image through DWT-DCT and Logistic mapping.Subsequently,a 3-stage DWT transformation is applied to the encrypted medical image,with the centre point of the LL3 sub-band within its low-frequency component serving as the sampling point.The Daisy descriptor matrix for this point is then computed.Finally,a DCT transformation is performed on the Daisy descriptor matrix,and the low-frequency portion is processed using the perceptual hashing algorithm to generate a 32-bit binary feature vector for the medical image.This scheme utilises cryptographic knowledge and zero-watermarking technique to embed watermarks without modifying medical images and can extract the watermark from test images without the original image,which meets the basic re-quirements of medical image watermarking.The embedding and extraction of water-marks are accomplished in a mere 0.160 and 0.411s,respectively,with minimal computational overhead.Simulation results demonstrate the robustness of the algorithm against both conventional attacks and geometric attacks,with a notable performance in resisting rotation attacks.
基金Science and Technology Funds from the Liaoning Education Department(Serial Number:LJKZ0104).
文摘The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.
基金supported in part by the Hainan Provincial Natural Science Foundation of China (No.620MS067)the Intelligent Medical Project of Chongqing Medical University (ZHYXQNRC202101)the Student Scientific Research and Innovation Experiment Project of the Medical Information College of Chongqing Medical University (No.2020C006).
文摘The field of healthcare is considered to be the most promising application of intelligent sensor networks.However,the security and privacy protection ofmedical images collected by intelligent sensor networks is a hot problem that has attracted more and more attention.Fortunately,digital watermarking provides an effective method to solve this problem.In order to improve the robustness of the medical image watermarking scheme,in this paper,we propose a novel zero-watermarking algorithm with the integer wavelet transform(IWT),Schur decomposition and image block energy.Specifically,we first use IWT to extract low-frequency information and divide them into non-overlapping blocks,then we decompose the sub-blocks by Schur decomposition.After that,the feature matrix is constructed according to the relationship between the image block energy and the whole image energy.At the same time,we encrypt watermarking with the logistic chaotic position scrambling.Finally,the zero-watermarking is obtained by XOR operation with the encrypted watermarking.Three indexes of peak signal-to-noise ratio,normalization coefficient(NC)and the bit error rate(BER)are used to evaluate the robustness of the algorithm.According to the experimental results,most of the NC values are around 0.9 under various attacks,while the BER values are very close to 0.These experimental results show that the proposed algorithm is more robust than the existing zero-watermarking methods,which indicates it is more suitable for medical image privacy and security protection.
文摘Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.
基金supported by the Key Research Program of the Chinese Academy of Sciences(grant number ZDRW-ZS-2021-1-2).
文摘Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.
文摘Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence.
基金funded by the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie Action(MSCA)grant agreement No.101109961.
文摘Videos represent the most prevailing form of digital media for communication,information dissemination,and monitoring.However,theirwidespread use has increased the risks of unauthorised access andmanipulation,posing significant challenges.In response,various protection approaches have been developed to secure,authenticate,and ensure the integrity of digital videos.This study provides a comprehensive survey of the challenges associated with maintaining the confidentiality,integrity,and availability of video content,and examining how it can be manipulated.It then investigates current developments in the field of video security by exploring two critical research questions.First,it examine the techniques used by adversaries to compromise video data and evaluate their impact.Understanding these attack methodologies is crucial for developing effective defense mechanisms.Second,it explores the various security approaches that can be employed to protect video data,enhancing its transparency,integrity,and trustworthiness.It compares the effectiveness of these approaches across different use cases,including surveillance,video on demand(VoD),and medical videos related to disease diagnostics.Finally,it identifies potential research opportunities to enhance video data protection in response to the evolving threat landscape.Through this investigation,this study aims to contribute to the ongoing efforts in securing video data,providing insights that are vital for researchers,practitioners,and policymakers dedicated to enhancing the safety and reliability of video content in our digital world.
文摘Cloud computing has drastically changed the delivery and consumption of live streaming content.The designs,challenges,and possible uses of cloud computing for live streaming are studied.A comprehensive overview of the technical and business issues surrounding cloudbased live streaming is provided,including the benefits of cloud computing,the various live streaming architectures,and the challenges that live streaming service providers face in delivering high‐quality,real‐time services.The different techniques used to improve the performance of video streaming,such as adaptive bit‐rate streaming,multicast distribution,and edge computing are discussed and the necessity of low‐latency and high‐quality video transmission in cloud‐based live streaming is underlined.Issues such as improving user experience and live streaming service performance using cutting‐edge technology,like artificial intelligence and machine learning are discussed.In addition,the legal and regulatory implications of cloud‐based live streaming,including issues with network neutrality,data privacy,and content moderation are addressed.The future of cloud computing for live streaming is examined in the section that follows,and it looks at the most likely new developments in terms of trends and technology.For technology vendors,live streaming service providers,and regulators,the findings have major policy‐relevant implications.Suggestions on how stakeholders should address these concerns and take advantage of the potential presented by this rapidly evolving sector,as well as insights into the key challenges and opportunities associated with cloud‐based live streaming are provided.
基金the National Natural Science Foundation of China(Grant Nos.62272478,62202496,61872384).
文摘Among steganalysis techniques,detection against MV(motion vector)domain-based video steganography in the HEVC(High Efficiency Video Coding)standard remains a challenging issue.For the purpose of improving the detection performance,this paper proposes a steganalysis method that can perfectly detectMV-based steganography in HEVC.Firstly,we define the local optimality of MVP(Motion Vector Prediction)based on the technology of AMVP(Advanced Motion Vector Prediction).Secondly,we analyze that in HEVC video,message embedding either usingMVP index orMVD(Motion Vector Difference)may destroy the above optimality of MVP.And then,we define the optimal rate of MVP as a steganalysis feature.Finally,we conduct steganalysis detection experiments on two general datasets for three popular steganographymethods and compare the performance with four state-ofthe-art steganalysis methods.The experimental results demonstrate the effectiveness of the proposed feature set.Furthermore,our method stands out for its practical applicability,requiring no model training and exhibiting low computational complexity,making it a viable solution for real-world scenarios.
基金supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP)funded by the Korea Government (MIST),Development of Collection and Integrated Analysis Methods of Automotive Inter and Intra System Artifacts through Construction of Event-Based Experimental System,under RS-2022-II221022.
文摘With the advancement of video recording devices and network infrastructure,we use surveillance cameras to protect our valuable assets.This paper proposes a novel system for encrypting personal information within recorded surveillance videos to enhance efficiency and security.The proposed method leverages Dlib’s CNN-based facial recognition technology to identify Regions of Interest(ROIs)within the video,linking these ROIs to generate unique IDs.These IDs are then combined with a master key to create entity-specific keys,which are used to encrypt the ROIs within the video.This system supports selective decryption,effectively protecting personal information using surveillance footage.Additionally,the system overcomes the limitations of existing ROI recognition technologies by predicting unrecognized frames through post-processing.This research validates the proposed technology through experimental evaluations of execution time and post-processing techniques,ensuring comprehensive personal information protection.Guidelines for setting the thresholds used in this process are also provided.Implementing the proposed method could serve as an effective solution to security vulnerabilities that traditional approaches fail to address.
文摘Objective: To study the problematic use of video games among secondary school students in the city of Parakou in 2023. Methods: Descriptive cross-sectional study conducted in the commune of Parakou from December 2022 to July 2023. The study population consisted of students regularly enrolled in public and private secondary schools in the city of Parakou for the 2022-2023 academic year. A two-stage non-proportional stratified sampling technique combined with simple random sampling was adopted. The Problem Video Game Playing (PVP) scale was used to assess problem gambling in the study population, while anxiety and depression were assessed using the Hospital Anxiety and Depression Scale (HADS). Results: A total of 1030 students were included. The mean age of the pupils surveyed was 15.06 ± 2.68 years, with extremes of 10 and 28 years. The [13 - 18] age group was the most represented, with a proportion of 59.6% (614) in the general population. Females predominated, at 52.8% (544), with a sex ratio of 0.89. The prevalence of problematic video game use was 24.9%, measured using the Video Game Playing scale. Associated factors were male gender (p = 0.005), pocket money under 10,000 cfa (p = 0.001) and between 20,000 - 90,000 cfa (p = 0.030), addictive family behavior (p < 0.001), monogamous family (p = 0.023), good relationship with father (p = 0.020), organization of video game competitions (p = 0.001) and definite anxiety (p Conclusion: Substance-free addiction is struggling to attract the attention it deserves, as it did in its infancy everywhere else. This study complements existing data and serves as a reminder of the need to focus on this group of addictions, whose problematic use of video games remains the most frequent due to its accessibility and social tolerance. Preventive action combined with curative measures remains the most effective means of combating the problem at national level.
基金The authors would like to thank Research Supporting Project Number(RSP2024R444)King Saud University,Riyadh,Saudi Arabia.
文摘Due to the exponential growth of video data,aided by rapid advancements in multimedia technologies.It became difficult for the user to obtain information from a large video series.The process of providing an abstract of the entire video that includes the most representative frames is known as static video summarization.This method resulted in rapid exploration,indexing,and retrieval of massive video libraries.We propose a framework for static video summary based on a Binary Robust Invariant Scalable Keypoint(BRISK)and bisecting K-means clustering algorithm.The current method effectively recognizes relevant frames using BRISK by extracting keypoints and the descriptors from video sequences.The video frames’BRISK features are clustered using a bisecting K-means,and the keyframe is determined by selecting the frame that is most near the cluster center.Without applying any clustering parameters,the appropriate clusters number is determined using the silhouette coefficient.Experiments were carried out on a publicly available open video project(OVP)dataset that contained videos of different genres.The proposed method’s effectiveness is compared to existing methods using a variety of evaluation metrics,and the proposed method achieves a trade-off between computational cost and quality.
基金supported in part by the National Natural Science Foundation of China under Grant 61873277in part by the Natural Science Basic Research Plan in Shaanxi Province of China underGrant 2020JQ-758in part by the Chinese Postdoctoral Science Foundation under Grant 2020M673446.
文摘In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits.
基金supported by the National Natural Science Foundation of China,with Fund Numbers 62272478,62102451the National Defense Science and Technology Independent Research Project(Intelligent Information Hiding Technology and Its Applications in a Certain Field)and Science and Technology Innovation Team Innovative Research Project“Research on Key Technologies for Intelligent Information Hiding”with Fund Number ZZKY20222102.
文摘Recent research advances in implicit neural representation have shown that a wide range of video data distributions are achieved by sharing model weights for Neural Representation for Videos(NeRV).While explicit methods exist for accurately embedding ownership or copyright information in video data,the nascent NeRV framework has yet to address this issue comprehensively.In response,this paper introduces MarkINeRV,a scheme designed to embed watermarking information into video frames using an invertible neural network watermarking approach to protect the copyright of NeRV,which models the embedding and extraction of watermarks as a pair of inverse processes of a reversible network and employs the same network to achieve embedding and extraction of watermarks.It is just that the information flow is in the opposite direction.Additionally,a video frame quality enhancement module is incorporated to mitigate watermarking information losses in the rendering process and the possibility ofmalicious attacks during transmission,ensuring the accurate extraction of watermarking information through the invertible network’s inverse process.This paper evaluates the accuracy,robustness,and invisibility of MarkINeRV through multiple video datasets.The results demonstrate its efficacy in extracting watermarking information for copyright protection of NeRV.MarkINeRV represents a pioneering investigation into copyright issues surrounding NeRV.
基金funded by National Key Research and Development Program of China(No.2022YFC3302103).
文摘High-resolution video transmission requires a substantial amount of bandwidth.In this paper,we present a novel video processing methodology that innovatively integrates region of interest(ROI)identification and super-resolution enhancement.Our method commences with the accurate detection of ROIs within video sequences,followed by the application of advanced super-resolution techniques to these areas,thereby preserving visual quality while economizing on data transmission.To validate and benchmark our approach,we have curated a new gaming dataset tailored to evaluate the effectiveness of ROI-based super-resolution in practical applications.The proposed model architecture leverages the transformer network framework,guided by a carefully designed multi-task loss function,which facilitates concurrent learning and execution of both ROI identification and resolution enhancement tasks.This unified deep learning model exhibits remarkable performance in achieving super-resolution on our custom dataset.The implications of this research extend to optimizing low-bitrate video streaming scenarios.By selectively enhancing the resolution of critical regions in videos,our solution enables high-quality video delivery under constrained bandwidth conditions.Empirical results demonstrate a 15%reduction in transmission bandwidth compared to traditional super-resolution based compression methods,without any perceivable decline in visual quality.This work thus contributes to the advancement of video compression and enhancement technologies,offering an effective strategy for improving digital media delivery efficiency and user experience,especially in bandwidth-limited environments.The innovative integration of ROI identification and super-resolution presents promising avenues for future research and development in adaptive and intelligent video communication systems.
基金This work was supported by Natural Science Foundation of Gansu Province under Grant Nos.21JR7RA570,20JR10RA334Basic Research Program of Gansu Province No.22JR11RA106,Gansu University of Political Science and Law Major Scientific Research and Innovation Projects under Grant No.GZF2020XZDA03+1 种基金the Young Doctoral Fund Project of Higher Education Institutions in Gansu Province in 2022 under Grant No.2022QB-123,Gansu Province Higher Education Innovation Fund Project under Grant No.2022A-097the University-Level Research Funding Project under Grant No.GZFXQNLW022 and University-Level Innovative Research Team of Gansu University of Political Science and Law.
文摘Video summarization aims to select key frames or key shots to create summaries for fast retrieval,compression,and efficient browsing of videos.Graph neural networks efficiently capture information about graph nodes and their neighbors,but ignore the dynamic dependencies between nodes.To address this challenge,we propose an innovative Adaptive Graph Convolutional Adjacency Matrix Network(TAMGCN),leveraging the attention mechanism to dynamically adjust dependencies between graph nodes.Specifically,we first segment shots and extract features of each frame,then compute the representative features of each shot.Subsequently,we utilize the attention mechanism to dynamically adjust the adjacency matrix of the graph convolutional network to better capture the dynamic dependencies between graph nodes.Finally,we fuse temporal features extracted by Bi-directional Long Short-Term Memory network with structural features extracted by the graph convolutional network to generate high-quality summaries.Extensive experiments are conducted on two benchmark datasets,TVSum and SumMe,yielding F1-scores of 60.8%and 53.2%,respectively.Experimental results demonstrate that our method outperforms most state-of-the-art video summarization techniques.
基金the National Natural Science Foundation of China under grants 61901078,61871062,and U20A20157in part by the China University Industry-University-Research Collaborative Innovation Fund(Future Network Innovation Research and Application Project)under grant 2021FNA04008+5 种基金in part by the China Postdoctoral Science Foundation under grant 2022MD713692in part by the Chongqing Postdoctoral Science Special Foundation under grant 2021XM2018in part by the Natural Science Foundation of Chongqing under grant cstc2020jcyj-zdxmX0024in part by University Innovation Research Group of Chongqing under grant CXQT20017in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant KJQN202000626in part by the Youth Innovation Group Support Program of ICE Discipline of CQUPT under grant SCIE-QN-2022-04.
文摘In this paper,we explore a distributed collaborative caching and computing model to support the distribution of adaptive bit rate video streaming.The aim is to reduce the average initial buffer delay and improve the quality of user experience.Considering the difference between global and local video popularities and the time-varying characteristics of video popularity,a two-stage caching scheme is proposed to push popular videos closer to users and minimize the average initial buffer delay.Based on both long-term content popularity and short-term content popularity,the proposed caching solution is decouple into the proactive cache stage and the cache update stage.In the proactive cache stage,we develop a proactive cache placement algorithm that can be executed in an off-peak period.In the cache update stage,we propose a reactive cache update algorithm to update the existing cache policy to minimize the buffer delay.Simulation results verify that the proposed caching algorithms can reduce the initial buffer delay efficiently.
文摘Video streaming applications have grown considerably in recent years.As a result,this becomes one of the most significant contributors to global internet traffic.According to recent studies,the telecommunications industry loses millions of dollars due to poor video Quality of Experience(QoE)for users.Among the standard proposals for standardizing the quality of video streaming over internet service providers(ISPs)is the Mean Opinion Score(MOS).However,the accurate finding of QoE by MOS is subjective and laborious,and it varies depending on the user.A fully automated data analytics framework is required to reduce the inter-operator variability characteristic in QoE assessment.This work addresses this concern by suggesting a novel hybrid XGBStackQoE analytical model using a two-level layering technique.Level one combines multiple Machine Learning(ML)models via a layer one Hybrid XGBStackQoE-model.Individual ML models at level one are trained using the entire training data set.The level two Hybrid XGBStackQoE-Model is fitted using the outputs(meta-features)of the layer one ML models.The proposed model outperformed the conventional models,with an accuracy improvement of 4 to 5 percent,which is still higher than the current traditional models.The proposed framework could significantly improve video QoE accuracy.
基金supported by the National Key Research and Development Project under Grant 2020YFB1807602Key Program of Marine Economy Development Special Foundation of Department of Natural Resources of Guangdong Province(GDNRC[2023]24)the National Natural Science Foundation of China under Grant 62271267.
文摘Recently,there have been significant advancements in the study of semantic communication in single-modal scenarios.However,the ability to process information in multi-modal environments remains limited.Inspired by the research and applications of natural language processing across different modalities,our goal is to accurately extract frame-level semantic information from videos and ultimately transmit high-quality videos.Specifically,we propose a deep learning-basedMulti-ModalMutual Enhancement Video Semantic Communication system,called M3E-VSC.Built upon a VectorQuantized Generative AdversarialNetwork(VQGAN),our systemaims to leverage mutual enhancement among different modalities by using text as the main carrier of transmission.With it,the semantic information can be extracted fromkey-frame images and audio of the video and performdifferential value to ensure that the extracted text conveys accurate semantic information with fewer bits,thus improving the capacity of the system.Furthermore,a multi-frame semantic detection module is designed to facilitate semantic transitions during video generation.Simulation results demonstrate that our proposed model maintains high robustness in complex noise environments,particularly in low signal-to-noise ratio conditions,significantly improving the accuracy and speed of semantic transmission in video communication by approximately 50 percent.
基金funded by the Natural Science Foundation China(NSFC)under Grant No.62203192.
文摘Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.