Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions i...Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.展开更多
Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(...Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(LPEN)to efficiently and effectively detect abnormal fall-down and trespass incidents in electric power training scenarios.The LPEN network,comprising three stages—MobileNet,Initial Stage,and Refinement Stage—is employed to swiftly extract image features,detect human key points,and refine them for accurate analysis.Subsequently,a Pose-aware Action Analysis Module(PAAM)captures the positional coordinates of human skeletal points in each frame.Finally,an Abnormal Action Inference Module(AAIM)evaluates whether abnormal fall-down or unauthorized trespass behavior is occurring.For fall-down recognition,three criteria—falling speed,main angles of skeletal points,and the person’s bounding box—are considered.To identify unauthorized trespass,emphasis is placed on the position of the ankles.Extensive experiments validate the effectiveness and efficiency of the proposed system in ensuring the safety and reliability of electric power training.展开更多
In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal ...In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal graph.Most GCNs define the graph topology by physical relations of the human joints.However,this predefined graph ignores the spatial relationship between non-adjacent joint pairs in special actions and the behavior dependence between joint pairs,resulting in a low recognition rate for specific actions with implicit correlation between joint pairs.In addition,existing methods ignore the trend correlation between adjacent frames within an action and context clues,leading to erroneous action recognition with similar poses.Therefore,this study proposes a learnable GCN based on behavior dependence,which considers implicit joint correlation by constructing a dynamic learnable graph with extraction of specific behavior dependence of joint pairs.By using the weight relationship between the joint pairs,an adaptive model is constructed.It also designs a self-attention module to obtain their inter-frame topological relationship for exploring the context of actions.Combining the shared topology and the multi-head self-attention map,the module obtains the context-based clue topology to update the dynamic graph convolution,achieving accurate recognition of different actions with similar poses.Detailed experiments on public datasets demonstrate that the proposed method achieves better results and realizes higher quality representation of actions under various evaluation protocols compared to state-of-the-art methods.展开更多
Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data...Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data,failing to meet the demands of various scenarios. Furthermore, multi-modal approaches lack the versatility toefficiently process both uniformand disparate input patterns.Thus, in this paper, an attention-enhanced pseudo-3Dresidual model is proposed to address the GAR problem, called HgaNets. This model comprises two independentcomponents designed formodeling visual RGB (red, green and blue) images and 3Dskeletal heatmaps, respectively.More specifically, each component consists of two main parts: 1) a multi-dimensional attention module forcapturing important spatial, temporal and feature information in human gestures;2) a spatiotemporal convolutionmodule that utilizes pseudo-3D residual convolution to characterize spatiotemporal features of gestures. Then,the output weights of the two components are fused to generate the recognition results. Finally, we conductedexperiments on four datasets to assess the efficiency of the proposed model. The results show that the accuracy onfour datasets reaches 85.40%, 91.91%, 94.70%, and 95.30%, respectively, as well as the inference time is 0.54 s andthe parameters is 2.74M. These findings highlight that the proposed model outperforms other existing approachesin terms of recognition accuracy.展开更多
In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the e...In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the extraction of basic features.The images captured by wearable sensors contain advanced features,allowing them to be analyzed by deep learning algorithms to enhance the detection and recognition of human actions.Poor lighting and limited sensor capabilities can impact data quality,making the recognition of human actions a challenging task.The unimodal-based HAR approaches are not suitable in a real-time environment.Therefore,an updated HAR model is developed using multiple types of data and an advanced deep-learning approach.Firstly,the required signals and sensor data are accumulated from the standard databases.From these signals,the wave features are retrieved.Then the extracted wave features and sensor data are given as the input to recognize the human activity.An Adaptive Hybrid Deep Attentive Network(AHDAN)is developed by incorporating a“1D Convolutional Neural Network(1DCNN)”with a“Gated Recurrent Unit(GRU)”for the human activity recognition process.Additionally,the Enhanced Archerfish Hunting Optimizer(EAHO)is suggested to fine-tune the network parameters for enhancing the recognition process.An experimental evaluation is performed on various deep learning networks and heuristic algorithms to confirm the effectiveness of the proposed HAR model.The EAHO-based HAR model outperforms traditional deep learning networks with an accuracy of 95.36,95.25 for recall,95.48 for specificity,and 95.47 for precision,respectively.The result proved that the developed model is effective in recognizing human action by taking less time.Additionally,it reduces the computation complexity and overfitting issue through using an optimization approach.展开更多
Human Action Recognition(HAR)in uncontrolled environments targets to recognition of different actions froma video.An effective HAR model can be employed for an application like human-computer interaction,health care,p...Human Action Recognition(HAR)in uncontrolled environments targets to recognition of different actions froma video.An effective HAR model can be employed for an application like human-computer interaction,health care,person tracking,and video surveillance.Machine Learning(ML)approaches,specifically,Convolutional Neural Network(CNN)models had beenwidely used and achieved impressive results through feature fusion.The accuracy and effectiveness of these models continue to be the biggest challenge in this field.In this article,a novel feature optimization algorithm,called improved Shark Smell Optimization(iSSO)is proposed to reduce the redundancy of extracted features.This proposed technique is inspired by the behavior ofwhite sharks,and howthey find the best prey in thewhole search space.The proposed iSSOalgorithmdivides the FeatureVector(FV)into subparts,where a search is conducted to find optimal local features fromeach subpart of FV.Once local optimal features are selected,a global search is conducted to further optimize these features.The proposed iSSO algorithm is employed on nine(9)selected CNN models.These CNN models are selected based on their top-1 and top-5 accuracy in ImageNet competition.To evaluate the model,two publicly available datasets UCF-Sports and Hollywood2 are selected.展开更多
The combination of spatiotemporal videos and essential features can improve the performance of human action recognition(HAR);however,the individual type of features usually degrades the performance due to similar acti...The combination of spatiotemporal videos and essential features can improve the performance of human action recognition(HAR);however,the individual type of features usually degrades the performance due to similar actions and complex backgrounds.The deep convolutional neural network has improved performance in recent years for several computer vision applications due to its spatial information.This article proposes a new framework called for video surveillance human action recognition dubbed HybridHR-Net.On a few selected datasets,deep transfer learning is used to pre-trained the EfficientNet-b0 deep learning model.Bayesian optimization is employed for the tuning of hyperparameters of the fine-tuned deep model.Instead of fully connected layer features,we considered the average pooling layer features and performed two feature selection techniques-an improved artificial bee colony and an entropy-based approach.Using a serial nature technique,the features that were selected are combined into a single vector,and then the results are categorized by machine learning classifiers.Five publically accessible datasets have been utilized for the experimental approach and obtained notable accuracy of 97%,98.7%,100%,99.7%,and 96.8%,respectively.Additionally,a comparison of the proposed framework with contemporarymethods is done to demonstrate the increase in accuracy.展开更多
The ever-growing available visual data(i.e.,uploaded videos and pictures by internet users)has attracted the research community’s attention in the computer vision field.Therefore,finding efficient solutions to extrac...The ever-growing available visual data(i.e.,uploaded videos and pictures by internet users)has attracted the research community’s attention in the computer vision field.Therefore,finding efficient solutions to extract knowledge from these sources is imperative.Recently,the BlazePose system has been released for skeleton extraction from images oriented to mobile devices.With this skeleton graph representation in place,a Spatial-Temporal Graph Convolutional Network can be implemented to predict the action.We hypothesize that just by changing the skeleton input data for a different set of joints that offers more information about the action of interest,it is possible to increase the performance of the Spatial-Temporal Graph Convolutional Network for HAR tasks.Hence,in this study,we present the first implementation of the BlazePose skeleton topology upon this architecture for action recognition.Moreover,we propose the Enhanced-BlazePose topology that can achieve better results than its predecessor.Additionally,we propose different skeleton detection thresholds that can improve the accuracy performance even further.We reached a top-1 accuracy performance of 40.1%on the Kinetics dataset.For the NTU-RGB+D dataset,we achieved 87.59%and 92.1%accuracy for Cross-Subject and Cross-View evaluation criteria,respectively.展开更多
Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal windo...Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.展开更多
Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action det...Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。展开更多
Human action recognition(HAR)based on Artificial intelligence reasoning is the most important research area in computer vision.Big breakthroughs in this field have been observed in the last few years;additionally,the ...Human action recognition(HAR)based on Artificial intelligence reasoning is the most important research area in computer vision.Big breakthroughs in this field have been observed in the last few years;additionally,the interest in research in this field is evolving,such as understanding of actions and scenes,studying human joints,and human posture recognition.Many HAR techniques are introduced in the literature.Nonetheless,the challenge of redundant and irrelevant features reduces recognition accuracy.They also faced a few other challenges,such as differing perspectives,environmental conditions,and temporal variations,among others.In this work,a deep learning and improved whale optimization algorithm based framework is proposed for HAR.The proposed framework consists of a few core stages i.e.,frames initial preprocessing,fine-tuned pre-trained deep learning models through transfer learning(TL),features fusion using modified serial based approach,and improved whale optimization based best features selection for final classification.Two pre-trained deep learning models such as InceptionV3 and Resnet101 are fine-tuned and TL is employed to train on action recognition datasets.The fusion process increases the length of feature vectors;therefore,improved whale optimization algorithm is proposed and selects the best features.The best selected features are finally classified usingmachine learning(ML)classifiers.Four publicly accessible datasets such as Ut-interaction,Hollywood,Free Viewpoint Action Recognition usingMotion History Volumes(IXMAS),and centre of computer vision(UCF)Sports,are employed and achieved the testing accuracy of 100%,99.9%,99.1%,and 100%respectively.Comparison with state of the art techniques(SOTA),the proposed method showed the improved accuracy.展开更多
Human action recognition(HAR)attempts to understand a subject’sbehavior and assign a label to each action performed.It is more appealingbecause it has a wide range of applications in computer vision,such asvideo surv...Human action recognition(HAR)attempts to understand a subject’sbehavior and assign a label to each action performed.It is more appealingbecause it has a wide range of applications in computer vision,such asvideo surveillance and smart cities.Many attempts have been made in theliterature to develop an effective and robust framework for HAR.Still,theprocess remains difficult and may result in reduced accuracy due to severalchallenges,such as similarity among actions,extraction of essential features,and reduction of irrelevant features.In this work,we proposed an end-toendframework using deep learning and an improved tree seed optimizationalgorithm for accurate HAR.The proposed design consists of a fewsignificantsteps.In the first step,frame preprocessing is performed.In the second step,two pre-trained deep learning models are fine-tuned and trained throughdeep transfer learning using preprocessed video frames.In the next step,deeplearning features of both fine-tuned models are fused using a new ParallelStandard Deviation Padding Max Value approach.The fused features arefurther optimized using an improved tree seed algorithm,and select the bestfeatures are finally classified by using the machine learning classifiers.Theexperiment was carried out on five publicly available datasets,including UTInteraction,Weizmann,KTH,Hollywood,and IXAMS,and achieved higheraccuracy than previous techniques.展开更多
The BlazePose,which models human body skeletons as spatiotem-poral graphs,has achieved fantastic performance in skeleton-based action identification.Skeleton extraction from photos for mobile devices has been made pos...The BlazePose,which models human body skeletons as spatiotem-poral graphs,has achieved fantastic performance in skeleton-based action identification.Skeleton extraction from photos for mobile devices has been made possible by the BlazePose system.A Spatial-Temporal Graph Con-volutional Network(STGCN)can then forecast the actions.The Spatial-Temporal Graph Convolutional Network(STGCN)can be improved by simply replacing the skeleton input data with a different set of joints that provide more information about the activity of interest.On the other hand,existing approaches require the user to manually set the graph’s topology and then fix it across all input layers and samples.This research shows how to use the Statistical Fractal Search(SFS)-Guided whale optimization algorithm(GWOA).To get the best solution for the GWOA,we adopt the SFS diffusion algorithm,which uses the random walk with a Gaussian distribution method common to growing systems.Continuous values are transformed into binary to apply to the feature-selection problem in conjunction with the BlazePose skeletal topology and stochastic fractal search to construct a novel implementation of the BlazePose topology for action recognition.In our experiments,we employed the Kinetics and the NTU-RGB+D datasets.The achieved actiona accuracy in the X-View is 93.14%and in the X-Sub is 96.74%.In addition,the proposed model performs better in numerous statistical tests such as the Analysis of Variance(ANOVA),Wilcoxon signed-rank test,histogram,and times analysis.展开更多
An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction,information r...An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction,information redundancy,and insufficient extraction of frequency domain information in channels in 3D convolutional neural networks.Firstly,based on 3D CNN,this paper designs a new multilevel spatiotemporal feature fusion(MSF)structure,which is embedded in the network model,mainly through multilevel spatiotemporal feature separation,splicing and fusion,to achieve the fusion of spatial perceptual fields and short-medium-long time series information at different scales with reduced network parameters;In the second step,a multi-frequency channel and spatiotemporal attention module(FSAM)is introduced to assign different frequency features and spatiotemporal features in the channels are assigned corresponding weights to reduce the information redundancy of the feature maps.Finally,we embed the proposed method into the R3D model,which replaced the 2D convolutional filters in the 2D Resnet with 3D convolutional filters and conduct extensive experimental validation on the small and medium-sized dataset UCF101 and the largesized dataset Kinetics-400.The findings revealed that our model increased the recognition accuracy on both datasets.Results on the UCF101 dataset,in particular,demonstrate that our model outperforms R3D in terms of a maximum recognition accuracy improvement of 7.2%while using 34.2%fewer parameters.The MSF and FSAM are migrated to another traditional 3D action recognition model named C3D for application testing.The test results based on UCF101 show that the recognition accuracy is improved by 8.9%,proving the strong generalization ability and universality of the method in this paper.展开更多
To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-t...To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance.展开更多
Human action recognition under complex environment is a challenging work.Recently,sparse representation has achieved excellent results of dealing with human action recognition problem under different conditions.The ma...Human action recognition under complex environment is a challenging work.Recently,sparse representation has achieved excellent results of dealing with human action recognition problem under different conditions.The main idea of sparse representation classification is to construct a general classification scheme where the training samples of each class can be considered as the dictionary to express the query class,and the minimal reconstruction error indicates its corresponding class.However,how to learn a discriminative dictionary is still a difficult work.In this work,we make two contributions.First,we build a new and robust human action recognition framework by combining one modified sparse classification model and deep convolutional neural network(CNN)features.Secondly,we construct a novel classification model which consists of the representation-constrained term and the coefficients incoherence term.Experimental results on benchmark datasets show that our modified model can obtain competitive results in comparison to other state-of-the-art models.展开更多
Human Action Recognition(HAR)is an active research topic in machine learning for the last few decades.Visual surveillance,robotics,and pedestrian detection are the main applications for action recognition.Computer vis...Human Action Recognition(HAR)is an active research topic in machine learning for the last few decades.Visual surveillance,robotics,and pedestrian detection are the main applications for action recognition.Computer vision researchers have introduced many HAR techniques,but they still face challenges such as redundant features and the cost of computing.In this article,we proposed a new method for the use of deep learning for HAR.In the proposed method,video frames are initially pre-processed using a global contrast approach and later used to train a deep learning model using domain transfer learning.The Resnet-50 Pre-Trained Model is used as a deep learning model in this work.Features are extracted from two layers:Global Average Pool(GAP)and Fully Connected(FC).The features of both layers are fused by the Canonical Correlation Analysis(CCA).Then features are selected using the Shanon Entropy-based threshold function.The selected features are finally passed to multiple classifiers for final classification.Experiments are conducted on five publicly available datasets as IXMAS,UCF Sports,YouTube,UT-Interaction,and KTH.The accuracy of these data sets was 89.6%,99.7%,100%,96.7%and 96.6%,respectively.Comparison with existing techniques has shown that the proposed method provides improved accuracy for HAR.Also,the proposed method is computationally fast based on the time of execution.展开更多
The two-stream convolutional neural network exhibits excellent performance in the video action recognition.The crux of the matter is to use the frames already clipped by the videos and the optical flow images pre-extr...The two-stream convolutional neural network exhibits excellent performance in the video action recognition.The crux of the matter is to use the frames already clipped by the videos and the optical flow images pre-extracted by the frames,to train a model each,and to finally integrate the outputs of the two models.Nevertheless,the reliance on the pre-extraction of the optical flow impedes the efficiency of action recognition,and the temporal and the spatial streams are just simply fused at the ends,with one stream failing and the other stream succeeding.We propose a novel hidden two-stream collaborative(HTSC)learning network that masks the steps of extracting the optical flow in the network and greatly speeds up the action recognition.Based on the two-stream method,the two-stream collaborative learning model captures the interaction of the temporal and spatial features to greatly enhance the accuracy of recognition.Our proposed method is highly capable of achieving the balance of efficiency and precision on large-scale video action recognition datasets.展开更多
In the current era of multimedia information,it is increasingly urgent to realize intelligent video action recognition and content analysis.In the past few years,video action recognition,as an important direction in c...In the current era of multimedia information,it is increasingly urgent to realize intelligent video action recognition and content analysis.In the past few years,video action recognition,as an important direction in computer vision,has attracted many researchers and made much progress.First,this paper reviews the latest video action recognition methods based on Deep Neural Network and Markov Logic Network.Second,we analyze the characteristics of each method and the performance from the experiment results.Then compare the emphases of these methods and discuss the application scenarios.Finally,we consider and prospect the development trend and direction of this field.展开更多
In this paper,we propose a novel approach to recognise human activities from a different view.Although appearance-based recognition methods have been shown to be unsuitable for action recognition for varying views,the...In this paper,we propose a novel approach to recognise human activities from a different view.Although appearance-based recognition methods have been shown to be unsuitable for action recognition for varying views,there must be some regularity among the same action sequences of different views.Selfsimilarity matrices appear to be relative stable across views.However,the ability to effectively realise this stability is a problem.In this paper,we extract the shape-flow descriptor as the low-level feature and then choose the same number of key frames from the action sequences.Self-similarity matrices are obtained by computing the similarity between any pair of the key frames.The diagonal features of the similarity matrices are extracted as the highlevel feature representation of the action sequence and Support Vector Machines(SVM) is employed for classification.We test our approach on the IXMAS multi-view data set.The proposed approach is simple but effective when compared with other algorithms.展开更多
文摘Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.
基金supportted by Natural Science Foundation of Jiangsu Province(No.BK20230696).
文摘Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(LPEN)to efficiently and effectively detect abnormal fall-down and trespass incidents in electric power training scenarios.The LPEN network,comprising three stages—MobileNet,Initial Stage,and Refinement Stage—is employed to swiftly extract image features,detect human key points,and refine them for accurate analysis.Subsequently,a Pose-aware Action Analysis Module(PAAM)captures the positional coordinates of human skeletal points in each frame.Finally,an Abnormal Action Inference Module(AAIM)evaluates whether abnormal fall-down or unauthorized trespass behavior is occurring.For fall-down recognition,three criteria—falling speed,main angles of skeletal points,and the person’s bounding box—are considered.To identify unauthorized trespass,emphasis is placed on the position of the ankles.Extensive experiments validate the effectiveness and efficiency of the proposed system in ensuring the safety and reliability of electric power training.
基金supported in part by the 2023 Key Supported Project of the 14th Five Year Plan for Education and Science in Hunan Province with No.ND230795.
文摘In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal graph.Most GCNs define the graph topology by physical relations of the human joints.However,this predefined graph ignores the spatial relationship between non-adjacent joint pairs in special actions and the behavior dependence between joint pairs,resulting in a low recognition rate for specific actions with implicit correlation between joint pairs.In addition,existing methods ignore the trend correlation between adjacent frames within an action and context clues,leading to erroneous action recognition with similar poses.Therefore,this study proposes a learnable GCN based on behavior dependence,which considers implicit joint correlation by constructing a dynamic learnable graph with extraction of specific behavior dependence of joint pairs.By using the weight relationship between the joint pairs,an adaptive model is constructed.It also designs a self-attention module to obtain their inter-frame topological relationship for exploring the context of actions.Combining the shared topology and the multi-head self-attention map,the module obtains the context-based clue topology to update the dynamic graph convolution,achieving accurate recognition of different actions with similar poses.Detailed experiments on public datasets demonstrate that the proposed method achieves better results and realizes higher quality representation of actions under various evaluation protocols compared to state-of-the-art methods.
基金the National Natural Science Foundation of China under Grant No.62072255.
文摘Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data,failing to meet the demands of various scenarios. Furthermore, multi-modal approaches lack the versatility toefficiently process both uniformand disparate input patterns.Thus, in this paper, an attention-enhanced pseudo-3Dresidual model is proposed to address the GAR problem, called HgaNets. This model comprises two independentcomponents designed formodeling visual RGB (red, green and blue) images and 3Dskeletal heatmaps, respectively.More specifically, each component consists of two main parts: 1) a multi-dimensional attention module forcapturing important spatial, temporal and feature information in human gestures;2) a spatiotemporal convolutionmodule that utilizes pseudo-3D residual convolution to characterize spatiotemporal features of gestures. Then,the output weights of the two components are fused to generate the recognition results. Finally, we conductedexperiments on four datasets to assess the efficiency of the proposed model. The results show that the accuracy onfour datasets reaches 85.40%, 91.91%, 94.70%, and 95.30%, respectively, as well as the inference time is 0.54 s andthe parameters is 2.74M. These findings highlight that the proposed model outperforms other existing approachesin terms of recognition accuracy.
文摘In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the extraction of basic features.The images captured by wearable sensors contain advanced features,allowing them to be analyzed by deep learning algorithms to enhance the detection and recognition of human actions.Poor lighting and limited sensor capabilities can impact data quality,making the recognition of human actions a challenging task.The unimodal-based HAR approaches are not suitable in a real-time environment.Therefore,an updated HAR model is developed using multiple types of data and an advanced deep-learning approach.Firstly,the required signals and sensor data are accumulated from the standard databases.From these signals,the wave features are retrieved.Then the extracted wave features and sensor data are given as the input to recognize the human activity.An Adaptive Hybrid Deep Attentive Network(AHDAN)is developed by incorporating a“1D Convolutional Neural Network(1DCNN)”with a“Gated Recurrent Unit(GRU)”for the human activity recognition process.Additionally,the Enhanced Archerfish Hunting Optimizer(EAHO)is suggested to fine-tune the network parameters for enhancing the recognition process.An experimental evaluation is performed on various deep learning networks and heuristic algorithms to confirm the effectiveness of the proposed HAR model.The EAHO-based HAR model outperforms traditional deep learning networks with an accuracy of 95.36,95.25 for recall,95.48 for specificity,and 95.47 for precision,respectively.The result proved that the developed model is effective in recognizing human action by taking less time.Additionally,it reduces the computation complexity and overfitting issue through using an optimization approach.
基金supported by the Collabo R&D between Industry,Academy,and Research Institute(S3250534)funded by the Ministry of SMEs and Startups(MSS,Korea)the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.RS-2023-00218176)the Soonchunhyang University Research Fund.
文摘Human Action Recognition(HAR)in uncontrolled environments targets to recognition of different actions froma video.An effective HAR model can be employed for an application like human-computer interaction,health care,person tracking,and video surveillance.Machine Learning(ML)approaches,specifically,Convolutional Neural Network(CNN)models had beenwidely used and achieved impressive results through feature fusion.The accuracy and effectiveness of these models continue to be the biggest challenge in this field.In this article,a novel feature optimization algorithm,called improved Shark Smell Optimization(iSSO)is proposed to reduce the redundancy of extracted features.This proposed technique is inspired by the behavior ofwhite sharks,and howthey find the best prey in thewhole search space.The proposed iSSOalgorithmdivides the FeatureVector(FV)into subparts,where a search is conducted to find optimal local features fromeach subpart of FV.Once local optimal features are selected,a global search is conducted to further optimize these features.The proposed iSSO algorithm is employed on nine(9)selected CNN models.These CNN models are selected based on their top-1 and top-5 accuracy in ImageNet competition.To evaluate the model,two publicly available datasets UCF-Sports and Hollywood2 are selected.
文摘The combination of spatiotemporal videos and essential features can improve the performance of human action recognition(HAR);however,the individual type of features usually degrades the performance due to similar actions and complex backgrounds.The deep convolutional neural network has improved performance in recent years for several computer vision applications due to its spatial information.This article proposes a new framework called for video surveillance human action recognition dubbed HybridHR-Net.On a few selected datasets,deep transfer learning is used to pre-trained the EfficientNet-b0 deep learning model.Bayesian optimization is employed for the tuning of hyperparameters of the fine-tuned deep model.Instead of fully connected layer features,we considered the average pooling layer features and performed two feature selection techniques-an improved artificial bee colony and an entropy-based approach.Using a serial nature technique,the features that were selected are combined into a single vector,and then the results are categorized by machine learning classifiers.Five publically accessible datasets have been utilized for the experimental approach and obtained notable accuracy of 97%,98.7%,100%,99.7%,and 96.8%,respectively.Additionally,a comparison of the proposed framework with contemporarymethods is done to demonstrate the increase in accuracy.
文摘The ever-growing available visual data(i.e.,uploaded videos and pictures by internet users)has attracted the research community’s attention in the computer vision field.Therefore,finding efficient solutions to extract knowledge from these sources is imperative.Recently,the BlazePose system has been released for skeleton extraction from images oriented to mobile devices.With this skeleton graph representation in place,a Spatial-Temporal Graph Convolutional Network can be implemented to predict the action.We hypothesize that just by changing the skeleton input data for a different set of joints that offers more information about the action of interest,it is possible to increase the performance of the Spatial-Temporal Graph Convolutional Network for HAR tasks.Hence,in this study,we present the first implementation of the BlazePose skeleton topology upon this architecture for action recognition.Moreover,we propose the Enhanced-BlazePose topology that can achieve better results than its predecessor.Additionally,we propose different skeleton detection thresholds that can improve the accuracy performance even further.We reached a top-1 accuracy performance of 40.1%on the Kinetics dataset.For the NTU-RGB+D dataset,we achieved 87.59%and 92.1%accuracy for Cross-Subject and Cross-View evaluation criteria,respectively.
基金supported by the research team of Xi’an Traffic Engineering Institute and the Young and middle-aged fund project of Xi’an Traffic Engineering Institute (2022KY-02).
文摘Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.
基金supported by the National Educational Science 13th Five-Year Plan Project(JYKYB2019012)the Basic Research Fund for the Engineering University of PAP(WJY201907)the Basic Research Fund of the Engineering University of PAP(WJY202120).
文摘Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。
基金This research work is supported in part by Chiang Mai University and HITEC University.
文摘Human action recognition(HAR)based on Artificial intelligence reasoning is the most important research area in computer vision.Big breakthroughs in this field have been observed in the last few years;additionally,the interest in research in this field is evolving,such as understanding of actions and scenes,studying human joints,and human posture recognition.Many HAR techniques are introduced in the literature.Nonetheless,the challenge of redundant and irrelevant features reduces recognition accuracy.They also faced a few other challenges,such as differing perspectives,environmental conditions,and temporal variations,among others.In this work,a deep learning and improved whale optimization algorithm based framework is proposed for HAR.The proposed framework consists of a few core stages i.e.,frames initial preprocessing,fine-tuned pre-trained deep learning models through transfer learning(TL),features fusion using modified serial based approach,and improved whale optimization based best features selection for final classification.Two pre-trained deep learning models such as InceptionV3 and Resnet101 are fine-tuned and TL is employed to train on action recognition datasets.The fusion process increases the length of feature vectors;therefore,improved whale optimization algorithm is proposed and selects the best features.The best selected features are finally classified usingmachine learning(ML)classifiers.Four publicly accessible datasets such as Ut-interaction,Hollywood,Free Viewpoint Action Recognition usingMotion History Volumes(IXMAS),and centre of computer vision(UCF)Sports,are employed and achieved the testing accuracy of 100%,99.9%,99.1%,and 100%respectively.Comparison with state of the art techniques(SOTA),the proposed method showed the improved accuracy.
基金supported by“Human Resources Program in Energy Technology”of the Korea Institute of Energy Technology Evaluation and Planning(KETEP),granted financial resources from the Ministry of Trade,Industry&Energy,Republic of Korea.(No.20204010600090).
文摘Human action recognition(HAR)attempts to understand a subject’sbehavior and assign a label to each action performed.It is more appealingbecause it has a wide range of applications in computer vision,such asvideo surveillance and smart cities.Many attempts have been made in theliterature to develop an effective and robust framework for HAR.Still,theprocess remains difficult and may result in reduced accuracy due to severalchallenges,such as similarity among actions,extraction of essential features,and reduction of irrelevant features.In this work,we proposed an end-toendframework using deep learning and an improved tree seed optimizationalgorithm for accurate HAR.The proposed design consists of a fewsignificantsteps.In the first step,frame preprocessing is performed.In the second step,two pre-trained deep learning models are fine-tuned and trained throughdeep transfer learning using preprocessed video frames.In the next step,deeplearning features of both fine-tuned models are fused using a new ParallelStandard Deviation Padding Max Value approach.The fused features arefurther optimized using an improved tree seed algorithm,and select the bestfeatures are finally classified by using the machine learning classifiers.Theexperiment was carried out on five publicly available datasets,including UTInteraction,Weizmann,KTH,Hollywood,and IXAMS,and achieved higheraccuracy than previous techniques.
文摘The BlazePose,which models human body skeletons as spatiotem-poral graphs,has achieved fantastic performance in skeleton-based action identification.Skeleton extraction from photos for mobile devices has been made possible by the BlazePose system.A Spatial-Temporal Graph Con-volutional Network(STGCN)can then forecast the actions.The Spatial-Temporal Graph Convolutional Network(STGCN)can be improved by simply replacing the skeleton input data with a different set of joints that provide more information about the activity of interest.On the other hand,existing approaches require the user to manually set the graph’s topology and then fix it across all input layers and samples.This research shows how to use the Statistical Fractal Search(SFS)-Guided whale optimization algorithm(GWOA).To get the best solution for the GWOA,we adopt the SFS diffusion algorithm,which uses the random walk with a Gaussian distribution method common to growing systems.Continuous values are transformed into binary to apply to the feature-selection problem in conjunction with the BlazePose skeletal topology and stochastic fractal search to construct a novel implementation of the BlazePose topology for action recognition.In our experiments,we employed the Kinetics and the NTU-RGB+D datasets.The achieved actiona accuracy in the X-View is 93.14%and in the X-Sub is 96.74%.In addition,the proposed model performs better in numerous statistical tests such as the Analysis of Variance(ANOVA),Wilcoxon signed-rank test,histogram,and times analysis.
基金supported by the General Program of the National Natural Science Foundation of China (62272234)the Enterprise Cooperation Project (2022h160)the Priority Academic Program Development of Jiangsu Higher Education Institutions Project.
文摘An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction,information redundancy,and insufficient extraction of frequency domain information in channels in 3D convolutional neural networks.Firstly,based on 3D CNN,this paper designs a new multilevel spatiotemporal feature fusion(MSF)structure,which is embedded in the network model,mainly through multilevel spatiotemporal feature separation,splicing and fusion,to achieve the fusion of spatial perceptual fields and short-medium-long time series information at different scales with reduced network parameters;In the second step,a multi-frequency channel and spatiotemporal attention module(FSAM)is introduced to assign different frequency features and spatiotemporal features in the channels are assigned corresponding weights to reduce the information redundancy of the feature maps.Finally,we embed the proposed method into the R3D model,which replaced the 2D convolutional filters in the 2D Resnet with 3D convolutional filters and conduct extensive experimental validation on the small and medium-sized dataset UCF101 and the largesized dataset Kinetics-400.The findings revealed that our model increased the recognition accuracy on both datasets.Results on the UCF101 dataset,in particular,demonstrate that our model outperforms R3D in terms of a maximum recognition accuracy improvement of 7.2%while using 34.2%fewer parameters.The MSF and FSAM are migrated to another traditional 3D action recognition model named C3D for application testing.The test results based on UCF101 show that the recognition accuracy is improved by 8.9%,proving the strong generalization ability and universality of the method in this paper.
基金The National Natural Science Foundation of China(No.60971098,61201345)
文摘To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance.
基金This research was funded by the National Natural Science Foundation of China(21878124,31771680 and 61773182).
文摘Human action recognition under complex environment is a challenging work.Recently,sparse representation has achieved excellent results of dealing with human action recognition problem under different conditions.The main idea of sparse representation classification is to construct a general classification scheme where the training samples of each class can be considered as the dictionary to express the query class,and the minimal reconstruction error indicates its corresponding class.However,how to learn a discriminative dictionary is still a difficult work.In this work,we make two contributions.First,we build a new and robust human action recognition framework by combining one modified sparse classification model and deep convolutional neural network(CNN)features.Secondly,we construct a novel classification model which consists of the representation-constrained term and the coefficients incoherence term.Experimental results on benchmark datasets show that our modified model can obtain competitive results in comparison to other state-of-the-art models.
基金This research was supported by Korea Institute for Advancement of Technology(KIAT)grant funded by the Korea Government(MOTIE)(P0012724,The Competency Development Program for Industry Specialist)and the Soonchunhyang University Research Fund.
文摘Human Action Recognition(HAR)is an active research topic in machine learning for the last few decades.Visual surveillance,robotics,and pedestrian detection are the main applications for action recognition.Computer vision researchers have introduced many HAR techniques,but they still face challenges such as redundant features and the cost of computing.In this article,we proposed a new method for the use of deep learning for HAR.In the proposed method,video frames are initially pre-processed using a global contrast approach and later used to train a deep learning model using domain transfer learning.The Resnet-50 Pre-Trained Model is used as a deep learning model in this work.Features are extracted from two layers:Global Average Pool(GAP)and Fully Connected(FC).The features of both layers are fused by the Canonical Correlation Analysis(CCA).Then features are selected using the Shanon Entropy-based threshold function.The selected features are finally passed to multiple classifiers for final classification.Experiments are conducted on five publicly available datasets as IXMAS,UCF Sports,YouTube,UT-Interaction,and KTH.The accuracy of these data sets was 89.6%,99.7%,100%,96.7%and 96.6%,respectively.Comparison with existing techniques has shown that the proposed method provides improved accuracy for HAR.Also,the proposed method is computationally fast based on the time of execution.
基金This work was supported by the Scientific Research Fund of Hunan Provincial Education Department of China(Project No.17A007)the Teaching Reform and Research Project of Hunan Province of China(Project No.JG1615).
文摘The two-stream convolutional neural network exhibits excellent performance in the video action recognition.The crux of the matter is to use the frames already clipped by the videos and the optical flow images pre-extracted by the frames,to train a model each,and to finally integrate the outputs of the two models.Nevertheless,the reliance on the pre-extraction of the optical flow impedes the efficiency of action recognition,and the temporal and the spatial streams are just simply fused at the ends,with one stream failing and the other stream succeeding.We propose a novel hidden two-stream collaborative(HTSC)learning network that masks the steps of extracting the optical flow in the network and greatly speeds up the action recognition.Based on the two-stream method,the two-stream collaborative learning model captures the interaction of the temporal and spatial features to greatly enhance the accuracy of recognition.Our proposed method is highly capable of achieving the balance of efficiency and precision on large-scale video action recognition datasets.
基金This work was supported in part by National Science Foundation Project of P.R.China(Grant Nos.61503424,61331013)。
文摘In the current era of multimedia information,it is increasingly urgent to realize intelligent video action recognition and content analysis.In the past few years,video action recognition,as an important direction in computer vision,has attracted many researchers and made much progress.First,this paper reviews the latest video action recognition methods based on Deep Neural Network and Markov Logic Network.Second,we analyze the characteristics of each method and the performance from the experiment results.Then compare the emphases of these methods and discuss the application scenarios.Finally,we consider and prospect the development trend and direction of this field.
基金supported by a Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(Information and Communication Engineering)the Natural Science Foundation of Jiangsu Province under Grant No.BK2010523+2 种基金the National Natural Science Foundation of China under Grants No.61172118,No.61001152the University Natural Science Research Project of Jiangsu Province under Grant No.11KJB510012the Scientific Research Foundation of Nanjing University of Posts and Telecommunications under Grant No.NY210073
文摘In this paper,we propose a novel approach to recognise human activities from a different view.Although appearance-based recognition methods have been shown to be unsuitable for action recognition for varying views,there must be some regularity among the same action sequences of different views.Selfsimilarity matrices appear to be relative stable across views.However,the ability to effectively realise this stability is a problem.In this paper,we extract the shape-flow descriptor as the low-level feature and then choose the same number of key frames from the action sequences.Self-similarity matrices are obtained by computing the similarity between any pair of the key frames.The diagonal features of the similarity matrices are extracted as the highlevel feature representation of the action sequence and Support Vector Machines(SVM) is employed for classification.We test our approach on the IXMAS multi-view data set.The proposed approach is simple but effective when compared with other algorithms.