Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions i...Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.展开更多
Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(...Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(LPEN)to efficiently and effectively detect abnormal fall-down and trespass incidents in electric power training scenarios.The LPEN network,comprising three stages—MobileNet,Initial Stage,and Refinement Stage—is employed to swiftly extract image features,detect human key points,and refine them for accurate analysis.Subsequently,a Pose-aware Action Analysis Module(PAAM)captures the positional coordinates of human skeletal points in each frame.Finally,an Abnormal Action Inference Module(AAIM)evaluates whether abnormal fall-down or unauthorized trespass behavior is occurring.For fall-down recognition,three criteria—falling speed,main angles of skeletal points,and the person’s bounding box—are considered.To identify unauthorized trespass,emphasis is placed on the position of the ankles.Extensive experiments validate the effectiveness and efficiency of the proposed system in ensuring the safety and reliability of electric power training.展开更多
In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal ...In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal graph.Most GCNs define the graph topology by physical relations of the human joints.However,this predefined graph ignores the spatial relationship between non-adjacent joint pairs in special actions and the behavior dependence between joint pairs,resulting in a low recognition rate for specific actions with implicit correlation between joint pairs.In addition,existing methods ignore the trend correlation between adjacent frames within an action and context clues,leading to erroneous action recognition with similar poses.Therefore,this study proposes a learnable GCN based on behavior dependence,which considers implicit joint correlation by constructing a dynamic learnable graph with extraction of specific behavior dependence of joint pairs.By using the weight relationship between the joint pairs,an adaptive model is constructed.It also designs a self-attention module to obtain their inter-frame topological relationship for exploring the context of actions.Combining the shared topology and the multi-head self-attention map,the module obtains the context-based clue topology to update the dynamic graph convolution,achieving accurate recognition of different actions with similar poses.Detailed experiments on public datasets demonstrate that the proposed method achieves better results and realizes higher quality representation of actions under various evaluation protocols compared to state-of-the-art methods.展开更多
Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data...Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data,failing to meet the demands of various scenarios. Furthermore, multi-modal approaches lack the versatility toefficiently process both uniformand disparate input patterns.Thus, in this paper, an attention-enhanced pseudo-3Dresidual model is proposed to address the GAR problem, called HgaNets. This model comprises two independentcomponents designed formodeling visual RGB (red, green and blue) images and 3Dskeletal heatmaps, respectively.More specifically, each component consists of two main parts: 1) a multi-dimensional attention module forcapturing important spatial, temporal and feature information in human gestures;2) a spatiotemporal convolutionmodule that utilizes pseudo-3D residual convolution to characterize spatiotemporal features of gestures. Then,the output weights of the two components are fused to generate the recognition results. Finally, we conductedexperiments on four datasets to assess the efficiency of the proposed model. The results show that the accuracy onfour datasets reaches 85.40%, 91.91%, 94.70%, and 95.30%, respectively, as well as the inference time is 0.54 s andthe parameters is 2.74M. These findings highlight that the proposed model outperforms other existing approachesin terms of recognition accuracy.展开更多
In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the e...In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the extraction of basic features.The images captured by wearable sensors contain advanced features,allowing them to be analyzed by deep learning algorithms to enhance the detection and recognition of human actions.Poor lighting and limited sensor capabilities can impact data quality,making the recognition of human actions a challenging task.The unimodal-based HAR approaches are not suitable in a real-time environment.Therefore,an updated HAR model is developed using multiple types of data and an advanced deep-learning approach.Firstly,the required signals and sensor data are accumulated from the standard databases.From these signals,the wave features are retrieved.Then the extracted wave features and sensor data are given as the input to recognize the human activity.An Adaptive Hybrid Deep Attentive Network(AHDAN)is developed by incorporating a“1D Convolutional Neural Network(1DCNN)”with a“Gated Recurrent Unit(GRU)”for the human activity recognition process.Additionally,the Enhanced Archerfish Hunting Optimizer(EAHO)is suggested to fine-tune the network parameters for enhancing the recognition process.An experimental evaluation is performed on various deep learning networks and heuristic algorithms to confirm the effectiveness of the proposed HAR model.The EAHO-based HAR model outperforms traditional deep learning networks with an accuracy of 95.36,95.25 for recall,95.48 for specificity,and 95.47 for precision,respectively.The result proved that the developed model is effective in recognizing human action by taking less time.Additionally,it reduces the computation complexity and overfitting issue through using an optimization approach.展开更多
Human Action Recognition(HAR)in uncontrolled environments targets to recognition of different actions froma video.An effective HAR model can be employed for an application like human-computer interaction,health care,p...Human Action Recognition(HAR)in uncontrolled environments targets to recognition of different actions froma video.An effective HAR model can be employed for an application like human-computer interaction,health care,person tracking,and video surveillance.Machine Learning(ML)approaches,specifically,Convolutional Neural Network(CNN)models had beenwidely used and achieved impressive results through feature fusion.The accuracy and effectiveness of these models continue to be the biggest challenge in this field.In this article,a novel feature optimization algorithm,called improved Shark Smell Optimization(iSSO)is proposed to reduce the redundancy of extracted features.This proposed technique is inspired by the behavior ofwhite sharks,and howthey find the best prey in thewhole search space.The proposed iSSOalgorithmdivides the FeatureVector(FV)into subparts,where a search is conducted to find optimal local features fromeach subpart of FV.Once local optimal features are selected,a global search is conducted to further optimize these features.The proposed iSSO algorithm is employed on nine(9)selected CNN models.These CNN models are selected based on their top-1 and top-5 accuracy in ImageNet competition.To evaluate the model,two publicly available datasets UCF-Sports and Hollywood2 are selected.展开更多
The combination of spatiotemporal videos and essential features can improve the performance of human action recognition(HAR);however,the individual type of features usually degrades the performance due to similar acti...The combination of spatiotemporal videos and essential features can improve the performance of human action recognition(HAR);however,the individual type of features usually degrades the performance due to similar actions and complex backgrounds.The deep convolutional neural network has improved performance in recent years for several computer vision applications due to its spatial information.This article proposes a new framework called for video surveillance human action recognition dubbed HybridHR-Net.On a few selected datasets,deep transfer learning is used to pre-trained the EfficientNet-b0 deep learning model.Bayesian optimization is employed for the tuning of hyperparameters of the fine-tuned deep model.Instead of fully connected layer features,we considered the average pooling layer features and performed two feature selection techniques-an improved artificial bee colony and an entropy-based approach.Using a serial nature technique,the features that were selected are combined into a single vector,and then the results are categorized by machine learning classifiers.Five publically accessible datasets have been utilized for the experimental approach and obtained notable accuracy of 97%,98.7%,100%,99.7%,and 96.8%,respectively.Additionally,a comparison of the proposed framework with contemporarymethods is done to demonstrate the increase in accuracy.展开更多
Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal windo...Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.展开更多
Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action det...Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。展开更多
Human action recognition(HAR)attempts to understand a subject’sbehavior and assign a label to each action performed.It is more appealingbecause it has a wide range of applications in computer vision,such asvideo surv...Human action recognition(HAR)attempts to understand a subject’sbehavior and assign a label to each action performed.It is more appealingbecause it has a wide range of applications in computer vision,such asvideo surveillance and smart cities.Many attempts have been made in theliterature to develop an effective and robust framework for HAR.Still,theprocess remains difficult and may result in reduced accuracy due to severalchallenges,such as similarity among actions,extraction of essential features,and reduction of irrelevant features.In this work,we proposed an end-toendframework using deep learning and an improved tree seed optimizationalgorithm for accurate HAR.The proposed design consists of a fewsignificantsteps.In the first step,frame preprocessing is performed.In the second step,two pre-trained deep learning models are fine-tuned and trained throughdeep transfer learning using preprocessed video frames.In the next step,deeplearning features of both fine-tuned models are fused using a new ParallelStandard Deviation Padding Max Value approach.The fused features arefurther optimized using an improved tree seed algorithm,and select the bestfeatures are finally classified by using the machine learning classifiers.Theexperiment was carried out on five publicly available datasets,including UTInteraction,Weizmann,KTH,Hollywood,and IXAMS,and achieved higheraccuracy than previous techniques.展开更多
The development of artificial intelligence(AI)and smart home technologies has driven the need for speech recognition-based solutions.This demand stems from the quest for more intuitive and natural interaction between ...The development of artificial intelligence(AI)and smart home technologies has driven the need for speech recognition-based solutions.This demand stems from the quest for more intuitive and natural interaction between users and smart devices in their homes.Speech recognition allows users to control devices and perform everyday actions through spoken commands,eliminating the need for physical interfaces or touch screens and enabling specific tasks such as turning on or off the light,heating,or lowering the blinds.The purpose of this study is to develop a speech-based classification model for recognizing human actions in the smart home.It seeks to demonstrate the effectiveness and feasibility of using machine learning techniques in predicting categories,subcategories,and actions from sentences.A dataset labeled with relevant information about categories,subcategories,and actions related to human actions in the smart home is used.The methodology uses machine learning techniques implemented in Python,extracting features using CountVectorizer to convert sentences into numerical representations.The results show that the classification model is able to accurately predict categories,subcategories,and actions based on sentences,with 82.99%accuracy for category,76.19%accuracy for subcategory,and 90.28%accuracy for action.The study concludes that using machine learning techniques is effective for recognizing and classifying human actions in the smart home,supporting its feasibility in various scenarios and opening new possibilities for advanced natural language processing systems in the field of AI and smart homes.展开更多
The BlazePose,which models human body skeletons as spatiotem-poral graphs,has achieved fantastic performance in skeleton-based action identification.Skeleton extraction from photos for mobile devices has been made pos...The BlazePose,which models human body skeletons as spatiotem-poral graphs,has achieved fantastic performance in skeleton-based action identification.Skeleton extraction from photos for mobile devices has been made possible by the BlazePose system.A Spatial-Temporal Graph Con-volutional Network(STGCN)can then forecast the actions.The Spatial-Temporal Graph Convolutional Network(STGCN)can be improved by simply replacing the skeleton input data with a different set of joints that provide more information about the activity of interest.On the other hand,existing approaches require the user to manually set the graph’s topology and then fix it across all input layers and samples.This research shows how to use the Statistical Fractal Search(SFS)-Guided whale optimization algorithm(GWOA).To get the best solution for the GWOA,we adopt the SFS diffusion algorithm,which uses the random walk with a Gaussian distribution method common to growing systems.Continuous values are transformed into binary to apply to the feature-selection problem in conjunction with the BlazePose skeletal topology and stochastic fractal search to construct a novel implementation of the BlazePose topology for action recognition.In our experiments,we employed the Kinetics and the NTU-RGB+D datasets.The achieved actiona accuracy in the X-View is 93.14%and in the X-Sub is 96.74%.In addition,the proposed model performs better in numerous statistical tests such as the Analysis of Variance(ANOVA),Wilcoxon signed-rank test,histogram,and times analysis.展开更多
Sign language fills the communication gap for people with hearing and speaking ailments.It includes both visual modalities,manual gestures consisting of movements of hands,and non-manual gestures incorporating body mo...Sign language fills the communication gap for people with hearing and speaking ailments.It includes both visual modalities,manual gestures consisting of movements of hands,and non-manual gestures incorporating body movements including head,facial expressions,eyes,shoulder shrugging,etc.Previously both gestures have been detected;identifying separately may have better accuracy,butmuch communicational information is lost.Aproper sign language mechanism is needed to detect manual and non-manual gestures to convey the appropriate detailed message to others.Our novel proposed system contributes as Sign LanguageAction Transformer Network(SLATN),localizing hand,body,and facial gestures in video sequences.Here we are expending a Transformer-style structural design as a“base network”to extract features from a spatiotemporal domain.Themodel impulsively learns to track individual persons and their action context inmultiple frames.Furthermore,a“head network”emphasizes hand movement and facial expression simultaneously,which is often crucial to understanding sign language,using its attention mechanism for creating tight bounding boxes around classified gestures.The model’s work is later compared with the traditional identification methods of activity recognition.It not only works faster but achieves better accuracy as well.Themodel achieves overall 82.66%testing accuracy with a very considerable performance of computation with 94.13 Giga-Floating Point Operations per Second(G-FLOPS).Another contribution is a newly created dataset of Pakistan Sign Language forManual and Non-Manual(PkSLMNM)gestures.展开更多
The ever-growing available visual data(i.e.,uploaded videos and pictures by internet users)has attracted the research community’s attention in the computer vision field.Therefore,finding efficient solutions to extrac...The ever-growing available visual data(i.e.,uploaded videos and pictures by internet users)has attracted the research community’s attention in the computer vision field.Therefore,finding efficient solutions to extract knowledge from these sources is imperative.Recently,the BlazePose system has been released for skeleton extraction from images oriented to mobile devices.With this skeleton graph representation in place,a Spatial-Temporal Graph Convolutional Network can be implemented to predict the action.We hypothesize that just by changing the skeleton input data for a different set of joints that offers more information about the action of interest,it is possible to increase the performance of the Spatial-Temporal Graph Convolutional Network for HAR tasks.Hence,in this study,we present the first implementation of the BlazePose skeleton topology upon this architecture for action recognition.Moreover,we propose the Enhanced-BlazePose topology that can achieve better results than its predecessor.Additionally,we propose different skeleton detection thresholds that can improve the accuracy performance even further.We reached a top-1 accuracy performance of 40.1%on the Kinetics dataset.For the NTU-RGB+D dataset,we achieved 87.59%and 92.1%accuracy for Cross-Subject and Cross-View evaluation criteria,respectively.展开更多
Human action recognition(HAR)based on Artificial intelligence reasoning is the most important research area in computer vision.Big breakthroughs in this field have been observed in the last few years;additionally,the ...Human action recognition(HAR)based on Artificial intelligence reasoning is the most important research area in computer vision.Big breakthroughs in this field have been observed in the last few years;additionally,the interest in research in this field is evolving,such as understanding of actions and scenes,studying human joints,and human posture recognition.Many HAR techniques are introduced in the literature.Nonetheless,the challenge of redundant and irrelevant features reduces recognition accuracy.They also faced a few other challenges,such as differing perspectives,environmental conditions,and temporal variations,among others.In this work,a deep learning and improved whale optimization algorithm based framework is proposed for HAR.The proposed framework consists of a few core stages i.e.,frames initial preprocessing,fine-tuned pre-trained deep learning models through transfer learning(TL),features fusion using modified serial based approach,and improved whale optimization based best features selection for final classification.Two pre-trained deep learning models such as InceptionV3 and Resnet101 are fine-tuned and TL is employed to train on action recognition datasets.The fusion process increases the length of feature vectors;therefore,improved whale optimization algorithm is proposed and selects the best features.The best selected features are finally classified usingmachine learning(ML)classifiers.Four publicly accessible datasets such as Ut-interaction,Hollywood,Free Viewpoint Action Recognition usingMotion History Volumes(IXMAS),and centre of computer vision(UCF)Sports,are employed and achieved the testing accuracy of 100%,99.9%,99.1%,and 100%respectively.Comparison with state of the art techniques(SOTA),the proposed method showed the improved accuracy.展开更多
An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction,information r...An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction,information redundancy,and insufficient extraction of frequency domain information in channels in 3D convolutional neural networks.Firstly,based on 3D CNN,this paper designs a new multilevel spatiotemporal feature fusion(MSF)structure,which is embedded in the network model,mainly through multilevel spatiotemporal feature separation,splicing and fusion,to achieve the fusion of spatial perceptual fields and short-medium-long time series information at different scales with reduced network parameters;In the second step,a multi-frequency channel and spatiotemporal attention module(FSAM)is introduced to assign different frequency features and spatiotemporal features in the channels are assigned corresponding weights to reduce the information redundancy of the feature maps.Finally,we embed the proposed method into the R3D model,which replaced the 2D convolutional filters in the 2D Resnet with 3D convolutional filters and conduct extensive experimental validation on the small and medium-sized dataset UCF101 and the largesized dataset Kinetics-400.The findings revealed that our model increased the recognition accuracy on both datasets.Results on the UCF101 dataset,in particular,demonstrate that our model outperforms R3D in terms of a maximum recognition accuracy improvement of 7.2%while using 34.2%fewer parameters.The MSF and FSAM are migrated to another traditional 3D action recognition model named C3D for application testing.The test results based on UCF101 show that the recognition accuracy is improved by 8.9%,proving the strong generalization ability and universality of the method in this paper.展开更多
Artificial intelligence is increasingly being applied in the field of video analysis,particularly in the area of public safety where video surveillance equipment such as closed-circuit television(CCTV)is used and auto...Artificial intelligence is increasingly being applied in the field of video analysis,particularly in the area of public safety where video surveillance equipment such as closed-circuit television(CCTV)is used and automated analysis of video information is required.However,various issues such as data size limitations and low processing speeds make real-time extraction of video data challenging.Video analysis technology applies object classification,detection,and relationship analysis to continuous 2D frame data,and the various meanings within the video are thus analyzed based on the extracted basic data.Motion recognition is key in this analysis.Motion recognition is a challenging field that analyzes human body movements,requiring the interpretation of complex movements of human joints and the relationships between various objects.The deep learning-based human skeleton detection algorithm is a representative motion recognition algorithm.Recently,motion analysis models such as the SlowFast network algorithm,have also been developed with excellent performance.However,these models do not operate properly in most wide-angle video environments outdoors,displaying low response speed,as expected from motion classification extraction in environments associated with high-resolution images.The proposed method achieves high level of extraction and accuracy by improving SlowFast’s input data preprocessing and data structure methods.The input data are preprocessed through object tracking and background removal using YOLO and DeepSORT.A higher performance than that of a single model is achieved by improving the existing SlowFast’s data structure into a frame unit structure.Based on the confusion matrix,accuracies of 70.16%and 70.74%were obtained for the existing SlowFast and proposed model,respectively,indicating a 0.58%increase in accuracy.Comparing detection,based on behavioral classification,the existing SlowFast detected 2,341,164 cases,whereas the proposed model detected 3,119,323 cases,which is an increase of 33.23%.展开更多
To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-t...To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance.展开更多
Video-based action recognition is becoming a vital tool in clinical research and neuroscientific study for disorder detection and prediction.However,action recognition currently used in non-human primate(NHP)research ...Video-based action recognition is becoming a vital tool in clinical research and neuroscientific study for disorder detection and prediction.However,action recognition currently used in non-human primate(NHP)research relies heavily on intense manual labor and lacks standardized assessment.In this work,we established two standard benchmark datasets of NHPs in the laboratory:Monkeyin Lab(Mi L),which includes 13 categories of actions and postures,and MiL2D,which includes sequences of two-dimensional(2D)skeleton features.Furthermore,based on recent methodological advances in deep learning and skeleton visualization,we introduced the Monkey Monitor Kit(Mon Kit)toolbox for automatic action recognition,posture estimation,and identification of fine motor activity in monkeys.Using the datasets and Mon Kit,we evaluated the daily behaviors of wild-type cynomolgus monkeys within their home cages and experimental environments and compared these observations with the behaviors exhibited by cynomolgus monkeys possessing mutations in the MECP2 gene as a disease model of Rett syndrome(RTT).Mon Kit was used to assess motor function,stereotyped behaviors,and depressive phenotypes,with the outcomes compared with human manual detection.Mon Kit established consistent criteria for identifying behavior in NHPs with high accuracy and efficiency,thus providing a novel and comprehensive tool for assessing phenotypic behavior in monkeys.展开更多
In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision lev...In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision level fusion is proposed. Firstly, the minimal 3D space region of human action region is detected by combining frame difference method and Vi BE algorithm, and the three-dimensional histogram of oriented gradient(HOG3D) is extracted. At the same time, the characteristics of global descriptors based on frequency domain filtering(FDF) and the local descriptors based on spatial-temporal interest points(STIP) are extracted. Principal component analysis(PCA) is implemented to reduce the dimension of the gradient histogram and the global descriptor, and bag of words(BoW) model is applied to describe the local descriptors based on STIP. Finally, a linear support vector machine(SVM) is used to create a new decision level fusion classifier. Some experiments are done to verify the performance of the multi-features, and the results show that they have good representation ability and generalization ability. Otherwise, the proposed scheme obtains very competitive results on the well-known datasets in terms of mean average precision.展开更多
文摘Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.
基金supportted by Natural Science Foundation of Jiangsu Province(No.BK20230696).
文摘Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(LPEN)to efficiently and effectively detect abnormal fall-down and trespass incidents in electric power training scenarios.The LPEN network,comprising three stages—MobileNet,Initial Stage,and Refinement Stage—is employed to swiftly extract image features,detect human key points,and refine them for accurate analysis.Subsequently,a Pose-aware Action Analysis Module(PAAM)captures the positional coordinates of human skeletal points in each frame.Finally,an Abnormal Action Inference Module(AAIM)evaluates whether abnormal fall-down or unauthorized trespass behavior is occurring.For fall-down recognition,three criteria—falling speed,main angles of skeletal points,and the person’s bounding box—are considered.To identify unauthorized trespass,emphasis is placed on the position of the ankles.Extensive experiments validate the effectiveness and efficiency of the proposed system in ensuring the safety and reliability of electric power training.
基金supported in part by the 2023 Key Supported Project of the 14th Five Year Plan for Education and Science in Hunan Province with No.ND230795.
文摘In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal graph.Most GCNs define the graph topology by physical relations of the human joints.However,this predefined graph ignores the spatial relationship between non-adjacent joint pairs in special actions and the behavior dependence between joint pairs,resulting in a low recognition rate for specific actions with implicit correlation between joint pairs.In addition,existing methods ignore the trend correlation between adjacent frames within an action and context clues,leading to erroneous action recognition with similar poses.Therefore,this study proposes a learnable GCN based on behavior dependence,which considers implicit joint correlation by constructing a dynamic learnable graph with extraction of specific behavior dependence of joint pairs.By using the weight relationship between the joint pairs,an adaptive model is constructed.It also designs a self-attention module to obtain their inter-frame topological relationship for exploring the context of actions.Combining the shared topology and the multi-head self-attention map,the module obtains the context-based clue topology to update the dynamic graph convolution,achieving accurate recognition of different actions with similar poses.Detailed experiments on public datasets demonstrate that the proposed method achieves better results and realizes higher quality representation of actions under various evaluation protocols compared to state-of-the-art methods.
基金the National Natural Science Foundation of China under Grant No.62072255.
文摘Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data,failing to meet the demands of various scenarios. Furthermore, multi-modal approaches lack the versatility toefficiently process both uniformand disparate input patterns.Thus, in this paper, an attention-enhanced pseudo-3Dresidual model is proposed to address the GAR problem, called HgaNets. This model comprises two independentcomponents designed formodeling visual RGB (red, green and blue) images and 3Dskeletal heatmaps, respectively.More specifically, each component consists of two main parts: 1) a multi-dimensional attention module forcapturing important spatial, temporal and feature information in human gestures;2) a spatiotemporal convolutionmodule that utilizes pseudo-3D residual convolution to characterize spatiotemporal features of gestures. Then,the output weights of the two components are fused to generate the recognition results. Finally, we conductedexperiments on four datasets to assess the efficiency of the proposed model. The results show that the accuracy onfour datasets reaches 85.40%, 91.91%, 94.70%, and 95.30%, respectively, as well as the inference time is 0.54 s andthe parameters is 2.74M. These findings highlight that the proposed model outperforms other existing approachesin terms of recognition accuracy.
文摘In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the extraction of basic features.The images captured by wearable sensors contain advanced features,allowing them to be analyzed by deep learning algorithms to enhance the detection and recognition of human actions.Poor lighting and limited sensor capabilities can impact data quality,making the recognition of human actions a challenging task.The unimodal-based HAR approaches are not suitable in a real-time environment.Therefore,an updated HAR model is developed using multiple types of data and an advanced deep-learning approach.Firstly,the required signals and sensor data are accumulated from the standard databases.From these signals,the wave features are retrieved.Then the extracted wave features and sensor data are given as the input to recognize the human activity.An Adaptive Hybrid Deep Attentive Network(AHDAN)is developed by incorporating a“1D Convolutional Neural Network(1DCNN)”with a“Gated Recurrent Unit(GRU)”for the human activity recognition process.Additionally,the Enhanced Archerfish Hunting Optimizer(EAHO)is suggested to fine-tune the network parameters for enhancing the recognition process.An experimental evaluation is performed on various deep learning networks and heuristic algorithms to confirm the effectiveness of the proposed HAR model.The EAHO-based HAR model outperforms traditional deep learning networks with an accuracy of 95.36,95.25 for recall,95.48 for specificity,and 95.47 for precision,respectively.The result proved that the developed model is effective in recognizing human action by taking less time.Additionally,it reduces the computation complexity and overfitting issue through using an optimization approach.
基金supported by the Collabo R&D between Industry,Academy,and Research Institute(S3250534)funded by the Ministry of SMEs and Startups(MSS,Korea)the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.RS-2023-00218176)the Soonchunhyang University Research Fund.
文摘Human Action Recognition(HAR)in uncontrolled environments targets to recognition of different actions froma video.An effective HAR model can be employed for an application like human-computer interaction,health care,person tracking,and video surveillance.Machine Learning(ML)approaches,specifically,Convolutional Neural Network(CNN)models had beenwidely used and achieved impressive results through feature fusion.The accuracy and effectiveness of these models continue to be the biggest challenge in this field.In this article,a novel feature optimization algorithm,called improved Shark Smell Optimization(iSSO)is proposed to reduce the redundancy of extracted features.This proposed technique is inspired by the behavior ofwhite sharks,and howthey find the best prey in thewhole search space.The proposed iSSOalgorithmdivides the FeatureVector(FV)into subparts,where a search is conducted to find optimal local features fromeach subpart of FV.Once local optimal features are selected,a global search is conducted to further optimize these features.The proposed iSSO algorithm is employed on nine(9)selected CNN models.These CNN models are selected based on their top-1 and top-5 accuracy in ImageNet competition.To evaluate the model,two publicly available datasets UCF-Sports and Hollywood2 are selected.
文摘The combination of spatiotemporal videos and essential features can improve the performance of human action recognition(HAR);however,the individual type of features usually degrades the performance due to similar actions and complex backgrounds.The deep convolutional neural network has improved performance in recent years for several computer vision applications due to its spatial information.This article proposes a new framework called for video surveillance human action recognition dubbed HybridHR-Net.On a few selected datasets,deep transfer learning is used to pre-trained the EfficientNet-b0 deep learning model.Bayesian optimization is employed for the tuning of hyperparameters of the fine-tuned deep model.Instead of fully connected layer features,we considered the average pooling layer features and performed two feature selection techniques-an improved artificial bee colony and an entropy-based approach.Using a serial nature technique,the features that were selected are combined into a single vector,and then the results are categorized by machine learning classifiers.Five publically accessible datasets have been utilized for the experimental approach and obtained notable accuracy of 97%,98.7%,100%,99.7%,and 96.8%,respectively.Additionally,a comparison of the proposed framework with contemporarymethods is done to demonstrate the increase in accuracy.
基金supported by the research team of Xi’an Traffic Engineering Institute and the Young and middle-aged fund project of Xi’an Traffic Engineering Institute (2022KY-02).
文摘Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.
基金supported by the National Educational Science 13th Five-Year Plan Project(JYKYB2019012)the Basic Research Fund for the Engineering University of PAP(WJY201907)the Basic Research Fund of the Engineering University of PAP(WJY202120).
文摘Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。
基金supported by“Human Resources Program in Energy Technology”of the Korea Institute of Energy Technology Evaluation and Planning(KETEP),granted financial resources from the Ministry of Trade,Industry&Energy,Republic of Korea.(No.20204010600090).
文摘Human action recognition(HAR)attempts to understand a subject’sbehavior and assign a label to each action performed.It is more appealingbecause it has a wide range of applications in computer vision,such asvideo surveillance and smart cities.Many attempts have been made in theliterature to develop an effective and robust framework for HAR.Still,theprocess remains difficult and may result in reduced accuracy due to severalchallenges,such as similarity among actions,extraction of essential features,and reduction of irrelevant features.In this work,we proposed an end-toendframework using deep learning and an improved tree seed optimizationalgorithm for accurate HAR.The proposed design consists of a fewsignificantsteps.In the first step,frame preprocessing is performed.In the second step,two pre-trained deep learning models are fine-tuned and trained throughdeep transfer learning using preprocessed video frames.In the next step,deeplearning features of both fine-tuned models are fused using a new ParallelStandard Deviation Padding Max Value approach.The fused features arefurther optimized using an improved tree seed algorithm,and select the bestfeatures are finally classified by using the machine learning classifiers.Theexperiment was carried out on five publicly available datasets,including UTInteraction,Weizmann,KTH,Hollywood,and IXAMS,and achieved higheraccuracy than previous techniques.
基金supported by Generalitat Valenciana with HAAS(CIAICO/2021/039)the Spanish Ministry of Science and Innovation under the Project AVANTIA PID2020-114480RB-I00.
文摘The development of artificial intelligence(AI)and smart home technologies has driven the need for speech recognition-based solutions.This demand stems from the quest for more intuitive and natural interaction between users and smart devices in their homes.Speech recognition allows users to control devices and perform everyday actions through spoken commands,eliminating the need for physical interfaces or touch screens and enabling specific tasks such as turning on or off the light,heating,or lowering the blinds.The purpose of this study is to develop a speech-based classification model for recognizing human actions in the smart home.It seeks to demonstrate the effectiveness and feasibility of using machine learning techniques in predicting categories,subcategories,and actions from sentences.A dataset labeled with relevant information about categories,subcategories,and actions related to human actions in the smart home is used.The methodology uses machine learning techniques implemented in Python,extracting features using CountVectorizer to convert sentences into numerical representations.The results show that the classification model is able to accurately predict categories,subcategories,and actions based on sentences,with 82.99%accuracy for category,76.19%accuracy for subcategory,and 90.28%accuracy for action.The study concludes that using machine learning techniques is effective for recognizing and classifying human actions in the smart home,supporting its feasibility in various scenarios and opening new possibilities for advanced natural language processing systems in the field of AI and smart homes.
文摘The BlazePose,which models human body skeletons as spatiotem-poral graphs,has achieved fantastic performance in skeleton-based action identification.Skeleton extraction from photos for mobile devices has been made possible by the BlazePose system.A Spatial-Temporal Graph Con-volutional Network(STGCN)can then forecast the actions.The Spatial-Temporal Graph Convolutional Network(STGCN)can be improved by simply replacing the skeleton input data with a different set of joints that provide more information about the activity of interest.On the other hand,existing approaches require the user to manually set the graph’s topology and then fix it across all input layers and samples.This research shows how to use the Statistical Fractal Search(SFS)-Guided whale optimization algorithm(GWOA).To get the best solution for the GWOA,we adopt the SFS diffusion algorithm,which uses the random walk with a Gaussian distribution method common to growing systems.Continuous values are transformed into binary to apply to the feature-selection problem in conjunction with the BlazePose skeletal topology and stochastic fractal search to construct a novel implementation of the BlazePose topology for action recognition.In our experiments,we employed the Kinetics and the NTU-RGB+D datasets.The achieved actiona accuracy in the X-View is 93.14%and in the X-Sub is 96.74%.In addition,the proposed model performs better in numerous statistical tests such as the Analysis of Variance(ANOVA),Wilcoxon signed-rank test,histogram,and times analysis.
文摘Sign language fills the communication gap for people with hearing and speaking ailments.It includes both visual modalities,manual gestures consisting of movements of hands,and non-manual gestures incorporating body movements including head,facial expressions,eyes,shoulder shrugging,etc.Previously both gestures have been detected;identifying separately may have better accuracy,butmuch communicational information is lost.Aproper sign language mechanism is needed to detect manual and non-manual gestures to convey the appropriate detailed message to others.Our novel proposed system contributes as Sign LanguageAction Transformer Network(SLATN),localizing hand,body,and facial gestures in video sequences.Here we are expending a Transformer-style structural design as a“base network”to extract features from a spatiotemporal domain.Themodel impulsively learns to track individual persons and their action context inmultiple frames.Furthermore,a“head network”emphasizes hand movement and facial expression simultaneously,which is often crucial to understanding sign language,using its attention mechanism for creating tight bounding boxes around classified gestures.The model’s work is later compared with the traditional identification methods of activity recognition.It not only works faster but achieves better accuracy as well.Themodel achieves overall 82.66%testing accuracy with a very considerable performance of computation with 94.13 Giga-Floating Point Operations per Second(G-FLOPS).Another contribution is a newly created dataset of Pakistan Sign Language forManual and Non-Manual(PkSLMNM)gestures.
文摘The ever-growing available visual data(i.e.,uploaded videos and pictures by internet users)has attracted the research community’s attention in the computer vision field.Therefore,finding efficient solutions to extract knowledge from these sources is imperative.Recently,the BlazePose system has been released for skeleton extraction from images oriented to mobile devices.With this skeleton graph representation in place,a Spatial-Temporal Graph Convolutional Network can be implemented to predict the action.We hypothesize that just by changing the skeleton input data for a different set of joints that offers more information about the action of interest,it is possible to increase the performance of the Spatial-Temporal Graph Convolutional Network for HAR tasks.Hence,in this study,we present the first implementation of the BlazePose skeleton topology upon this architecture for action recognition.Moreover,we propose the Enhanced-BlazePose topology that can achieve better results than its predecessor.Additionally,we propose different skeleton detection thresholds that can improve the accuracy performance even further.We reached a top-1 accuracy performance of 40.1%on the Kinetics dataset.For the NTU-RGB+D dataset,we achieved 87.59%and 92.1%accuracy for Cross-Subject and Cross-View evaluation criteria,respectively.
基金This research work is supported in part by Chiang Mai University and HITEC University.
文摘Human action recognition(HAR)based on Artificial intelligence reasoning is the most important research area in computer vision.Big breakthroughs in this field have been observed in the last few years;additionally,the interest in research in this field is evolving,such as understanding of actions and scenes,studying human joints,and human posture recognition.Many HAR techniques are introduced in the literature.Nonetheless,the challenge of redundant and irrelevant features reduces recognition accuracy.They also faced a few other challenges,such as differing perspectives,environmental conditions,and temporal variations,among others.In this work,a deep learning and improved whale optimization algorithm based framework is proposed for HAR.The proposed framework consists of a few core stages i.e.,frames initial preprocessing,fine-tuned pre-trained deep learning models through transfer learning(TL),features fusion using modified serial based approach,and improved whale optimization based best features selection for final classification.Two pre-trained deep learning models such as InceptionV3 and Resnet101 are fine-tuned and TL is employed to train on action recognition datasets.The fusion process increases the length of feature vectors;therefore,improved whale optimization algorithm is proposed and selects the best features.The best selected features are finally classified usingmachine learning(ML)classifiers.Four publicly accessible datasets such as Ut-interaction,Hollywood,Free Viewpoint Action Recognition usingMotion History Volumes(IXMAS),and centre of computer vision(UCF)Sports,are employed and achieved the testing accuracy of 100%,99.9%,99.1%,and 100%respectively.Comparison with state of the art techniques(SOTA),the proposed method showed the improved accuracy.
基金supported by the General Program of the National Natural Science Foundation of China (62272234)the Enterprise Cooperation Project (2022h160)the Priority Academic Program Development of Jiangsu Higher Education Institutions Project.
文摘An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction,information redundancy,and insufficient extraction of frequency domain information in channels in 3D convolutional neural networks.Firstly,based on 3D CNN,this paper designs a new multilevel spatiotemporal feature fusion(MSF)structure,which is embedded in the network model,mainly through multilevel spatiotemporal feature separation,splicing and fusion,to achieve the fusion of spatial perceptual fields and short-medium-long time series information at different scales with reduced network parameters;In the second step,a multi-frequency channel and spatiotemporal attention module(FSAM)is introduced to assign different frequency features and spatiotemporal features in the channels are assigned corresponding weights to reduce the information redundancy of the feature maps.Finally,we embed the proposed method into the R3D model,which replaced the 2D convolutional filters in the 2D Resnet with 3D convolutional filters and conduct extensive experimental validation on the small and medium-sized dataset UCF101 and the largesized dataset Kinetics-400.The findings revealed that our model increased the recognition accuracy on both datasets.Results on the UCF101 dataset,in particular,demonstrate that our model outperforms R3D in terms of a maximum recognition accuracy improvement of 7.2%while using 34.2%fewer parameters.The MSF and FSAM are migrated to another traditional 3D action recognition model named C3D for application testing.The test results based on UCF101 show that the recognition accuracy is improved by 8.9%,proving the strong generalization ability and universality of the method in this paper.
基金supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2020R1A6A1A03040583)supported by Kyonggi University’s Graduate Research Assistantship 2023.
文摘Artificial intelligence is increasingly being applied in the field of video analysis,particularly in the area of public safety where video surveillance equipment such as closed-circuit television(CCTV)is used and automated analysis of video information is required.However,various issues such as data size limitations and low processing speeds make real-time extraction of video data challenging.Video analysis technology applies object classification,detection,and relationship analysis to continuous 2D frame data,and the various meanings within the video are thus analyzed based on the extracted basic data.Motion recognition is key in this analysis.Motion recognition is a challenging field that analyzes human body movements,requiring the interpretation of complex movements of human joints and the relationships between various objects.The deep learning-based human skeleton detection algorithm is a representative motion recognition algorithm.Recently,motion analysis models such as the SlowFast network algorithm,have also been developed with excellent performance.However,these models do not operate properly in most wide-angle video environments outdoors,displaying low response speed,as expected from motion classification extraction in environments associated with high-resolution images.The proposed method achieves high level of extraction and accuracy by improving SlowFast’s input data preprocessing and data structure methods.The input data are preprocessed through object tracking and background removal using YOLO and DeepSORT.A higher performance than that of a single model is achieved by improving the existing SlowFast’s data structure into a frame unit structure.Based on the confusion matrix,accuracies of 70.16%and 70.74%were obtained for the existing SlowFast and proposed model,respectively,indicating a 0.58%increase in accuracy.Comparing detection,based on behavioral classification,the existing SlowFast detected 2,341,164 cases,whereas the proposed model detected 3,119,323 cases,which is an increase of 33.23%.
基金The National Natural Science Foundation of China(No.60971098,61201345)
文摘To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance.
基金supported by the National Key R&D Program of China (2021ZD0202805,2019YFA0709504,2021ZD0200900)National Defense Science and Technology Innovation Special Zone Spark Project (20-163-00-TS-009-152-01)+4 种基金National Natural Science Foundation of China (31900719,U20A20227,82125008)Innovative Research Team of High-level Local Universities in Shanghai,Science and Technology Committee Rising-Star Program (19QA1401400)111 Project (B18015)Shanghai Municipal Science and Technology Major Project (2018SHZDZX01)Shanghai Center for Brain Science and Brain-Inspired Technology。
文摘Video-based action recognition is becoming a vital tool in clinical research and neuroscientific study for disorder detection and prediction.However,action recognition currently used in non-human primate(NHP)research relies heavily on intense manual labor and lacks standardized assessment.In this work,we established two standard benchmark datasets of NHPs in the laboratory:Monkeyin Lab(Mi L),which includes 13 categories of actions and postures,and MiL2D,which includes sequences of two-dimensional(2D)skeleton features.Furthermore,based on recent methodological advances in deep learning and skeleton visualization,we introduced the Monkey Monitor Kit(Mon Kit)toolbox for automatic action recognition,posture estimation,and identification of fine motor activity in monkeys.Using the datasets and Mon Kit,we evaluated the daily behaviors of wild-type cynomolgus monkeys within their home cages and experimental environments and compared these observations with the behaviors exhibited by cynomolgus monkeys possessing mutations in the MECP2 gene as a disease model of Rett syndrome(RTT).Mon Kit was used to assess motor function,stereotyped behaviors,and depressive phenotypes,with the outcomes compared with human manual detection.Mon Kit established consistent criteria for identifying behavior in NHPs with high accuracy and efficiency,thus providing a novel and comprehensive tool for assessing phenotypic behavior in monkeys.
基金supported by the National Natural Science Foundation of China under Grant No. 61503424the Research Project by The State Ethnic Affairs Commission under Grant No. 14ZYZ017+2 种基金the Jiangsu Future Networks Innovation Institute-Prospective Research Project on Future Networks under Grant No. BY2013095-2-14the Fundamental Research Funds for the Central Universities No. FRF-TP-14-046A2the first-class discipline construction transitional funds of Minzu University of China
文摘In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision level fusion is proposed. Firstly, the minimal 3D space region of human action region is detected by combining frame difference method and Vi BE algorithm, and the three-dimensional histogram of oriented gradient(HOG3D) is extracted. At the same time, the characteristics of global descriptors based on frequency domain filtering(FDF) and the local descriptors based on spatial-temporal interest points(STIP) are extracted. Principal component analysis(PCA) is implemented to reduce the dimension of the gradient histogram and the global descriptor, and bag of words(BoW) model is applied to describe the local descriptors based on STIP. Finally, a linear support vector machine(SVM) is used to create a new decision level fusion classifier. Some experiments are done to verify the performance of the multi-features, and the results show that they have good representation ability and generalization ability. Otherwise, the proposed scheme obtains very competitive results on the well-known datasets in terms of mean average precision.