Human pose estimation aims to localize the body joints from image or video data.With the development of deeplearning,pose estimation has become a hot research topic in the field of computer vision.In recent years,huma...Human pose estimation aims to localize the body joints from image or video data.With the development of deeplearning,pose estimation has become a hot research topic in the field of computer vision.In recent years,humanpose estimation has achieved great success in multiple fields such as animation and sports.However,to obtainaccurate positioning results,existing methods may suffer from large model sizes,a high number of parameters,and increased complexity,leading to high computing costs.In this paper,we propose a new lightweight featureencoder to construct a high-resolution network that reduces the number of parameters and lowers the computingcost.We also introduced a semantic enhancement module that improves global feature extraction and networkperformance by combining channel and spatial dimensions.Furthermore,we propose a dense connected spatialpyramid pooling module to compensate for the decrease in image resolution and information loss in the network.Finally,ourmethod effectively reduces the number of parameters and complexitywhile ensuring high performance.Extensive experiments show that our method achieves a competitive performance while dramatically reducing thenumber of parameters,and operational complexity.Specifically,our method can obtain 89.9%AP score on MPIIVAL,while the number of parameters and the complexity of operations were reduced by 41%and 36%,respectively.展开更多
Human pose estimation is a basic and critical task in the field of computer vision that involves determining the position(or spatial coordinates)of the joints of the human body in a given image or video.It is widely u...Human pose estimation is a basic and critical task in the field of computer vision that involves determining the position(or spatial coordinates)of the joints of the human body in a given image or video.It is widely used in motion analysis,medical evaluation,and behavior monitoring.In this paper,the authors propose a method for multi-view human pose estimation.Two image sensors were placed orthogonally with respect to each other to capture the pose of the subject as they moved,and this yielded accurate and comprehensive results of three-dimensional(3D)motion reconstruction that helped capture their multi-directional poses.Following this,we propose a method based on 3D pose estimation to assess the similarity of the features of motion of patients with motor dysfunction by comparing differences between their range of motion and that of normal subjects.We converted these differences into Fugl–Meyer assessment(FMA)scores in order to quantify them.Finally,we implemented the proposed method in the Unity framework,and built a Virtual Reality platform that provides users with human–computer interaction to make the task more enjoyable for them and ensure their active participation in the assessment process.The goal is to provide a suitable means of assessing movement disorders without requiring the immediate supervision of a physician.展开更多
3D human pose estimation is a major focus area in the field of computer vision,which plays an important role in practical applications.This article summarizes the framework and research progress related to the estimat...3D human pose estimation is a major focus area in the field of computer vision,which plays an important role in practical applications.This article summarizes the framework and research progress related to the estimation of monocular RGB images and videos.An overall perspective ofmethods integrated with deep learning is introduced.Novel image-based and video-based inputs are proposed as the analysis framework.From this viewpoint,common problems are discussed.The diversity of human postures usually leads to problems such as occlusion and ambiguity,and the lack of training datasets often results in poor generalization ability of the model.Regression methods are crucial for solving such problems.Considering image-based input,the multi-view method is commonly used to solve occlusion problems.Here,the multi-view method is analyzed comprehensively.By referring to video-based input,the human prior knowledge of restricted motion is used to predict human postures.In addition,structural constraints are widely used as prior knowledge.Furthermore,weakly supervised learningmethods are studied and discussed for these two types of inputs to improve the model generalization ability.The problem of insufficient training datasets must also be considered,especially because 3D datasets are usually biased and limited.Finally,emerging and popular datasets and evaluation indicators are discussed.The characteristics of the datasets and the relationships of the indicators are explained and highlighted.Thus,this article can be useful and instructive for researchers who are lacking in experience and find this field confusing.In addition,by providing an overview of 3D human pose estimation,this article sorts and refines recent studies on 3D human pose estimation.It describes kernel problems and common useful methods,and discusses the scope for further research.展开更多
Human pose estimation(HPE)is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision(CV)communities.HPE finds its applications in several fields namel...Human pose estimation(HPE)is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision(CV)communities.HPE finds its applications in several fields namely activity recognition and human-computer interface.Despite the benefits of HPE,it is still a challenging process due to the variations in visual appearances,lighting,occlusions,dimensionality,etc.To resolve these issues,this paper presents a squirrel search optimization with a deep convolutional neural network for HPE(SSDCNN-HPE)technique.The major intention of the SSDCNN-HPE technique is to identify the human pose accurately and efficiently.Primarily,the video frame conversion process is performed and pre-processing takes place via bilateral filtering-based noise removal process.Then,the EfficientNet model is applied to identify the body points of a person with no problem constraints.Besides,the hyperparameter tuning of the EfficientNet model takes place by the use of the squirrel search algorithm(SSA).In the final stage,the multiclass support vector machine(M-SVM)technique was utilized for the identification and classification of human poses.The design of bilateral filtering followed by SSA based EfficientNetmodel for HPE depicts the novelty of the work.To demonstrate the enhanced outcomes of the SSDCNN-HPE approach,a series of simulations are executed.The experimental results reported the betterment of the SSDCNN-HPE system over the recent existing techniques in terms of different measures.展开更多
In the new era of technology,daily human activities are becoming more challenging in terms of monitoring complex scenes and backgrounds.To understand the scenes and activities from human life logs,human-object interac...In the new era of technology,daily human activities are becoming more challenging in terms of monitoring complex scenes and backgrounds.To understand the scenes and activities from human life logs,human-object interaction(HOI)is important in terms of visual relationship detection and human pose estimation.Activities understanding and interaction recognition between human and object along with the pose estimation and interaction modeling have been explained.Some existing algorithms and feature extraction procedures are complicated including accurate detection of rare human postures,occluded regions,and unsatisfactory detection of objects,especially small-sized objects.The existing HOI detection techniques are instancecentric(object-based)where interaction is predicted between all the pairs.Such estimation depends on appearance features and spatial information.Therefore,we propose a novel approach to demonstrate that the appearance features alone are not sufficient to predict the HOI.Furthermore,we detect the human body parts by using the Gaussian Matric Model(GMM)followed by object detection using YOLO.We predict the interaction points which directly classify the interaction and pair them with densely predicted HOI vectors by using the interaction algorithm.The interactions are linked with the human and object to predict the actions.The experiments have been performed on two benchmark HOI datasets demonstrating the proposed approach.展开更多
Recovering human pose from RGB images and videos has drawn increasing attention in recent years owing to minimum sensor requirements and applicability in diverse fields such as human-computer interaction,robotics,vide...Recovering human pose from RGB images and videos has drawn increasing attention in recent years owing to minimum sensor requirements and applicability in diverse fields such as human-computer interaction,robotics,video analytics,and augmented reality.Although a large amount of work has been devoted to this field,3D human pose estimation based on monocular images or videos remains a very challenging task due to a variety of difficulties such as depth ambiguities,occlusion,background clutters,and lack of training data.In this survey,we summarize recent advances in monocular 3D human pose estimation.We provide a general taxonomy to cover existing approaches and analyze their capabilities and limitations.We also present a summary of extensively used datasets and metrics,and provide a quantitative comparison of some representative methods.Finally,we conclude with a discussion on realistic challenges and open problems for future research directions.展开更多
In this article,a comprehensive survey of deep learning-based(DLbased)human pose estimation(HPE)that can help researchers in the domain of computer vision is presented.HPE is among the fastest-growing research domains...In this article,a comprehensive survey of deep learning-based(DLbased)human pose estimation(HPE)that can help researchers in the domain of computer vision is presented.HPE is among the fastest-growing research domains of computer vision and is used in solving several problems for human endeavours.After the detailed introduction,three different human body modes followed by the main stages of HPE and two pipelines of twodimensional(2D)HPE are presented.The details of the four components of HPE are also presented.The keypoints output format of two popular 2D HPE datasets and the most cited DL-based HPE articles from the year of breakthrough are both shown in tabular form.This study intends to highlight the limitations of published reviews and surveys respecting presenting a systematic review of the current DL-based solution to the 2D HPE model.Furthermore,a detailed and meaningful survey that will guide new and existing researchers on DL-based 2D HPE models is achieved.Finally,some future research directions in the field of HPE,such as limited data on disabled persons and multi-training DL-based models,are revealed to encourage researchers and promote the growth of HPE research.展开更多
Deep neural networks are vulnerable to attacks from adversarial inputs.Corresponding attack research on human pose estimation(HPE),particularly for body joint detection,has been largely unexplored.Transferring classif...Deep neural networks are vulnerable to attacks from adversarial inputs.Corresponding attack research on human pose estimation(HPE),particularly for body joint detection,has been largely unexplored.Transferring classification-based attack methods to body joint regression tasks is not straightforward.Another issue is that the attack effectiveness and imperceptibility contradict each other.To solve these issues,we propose local imperceptible attacks on HPE networks.In particular,we reformulate imperceptible attacks on body joint regression into a constrained maximum allowable attack.Furthermore,we approximate the solution using iterative gradient-based strength refinement and greedy-based pixel selection.Our method crafts effective perceptual adversarial attacks that consider both human perception and attack effectiveness.We conducted a series of imperceptible attacks against state-of-the-art HPE methods,including HigherHRNet,DEKR,and ViTPose.The experimental results demonstrate that the proposed method achieves excellent imperceptibility while maintaining attack effectiveness by significantly reducing the number of perturbed pixels.Approximately 4%of the pixels can achieve sufficient attacks on HPE.展开更多
Human pose estimation is a critical research area in the field of computer vision,playing a significant role in applications such as human-computer interaction,behavior analysis,and action recognition.In this paper,we...Human pose estimation is a critical research area in the field of computer vision,playing a significant role in applications such as human-computer interaction,behavior analysis,and action recognition.In this paper,we propose a U-shaped keypoint detection network(DAUNet)based on an improved ResNet subsampling structure and spatial grouping mechanism.This network addresses key challenges in traditional methods,such as information loss,large network redundancy,and insufficient sensitivity to low-resolution features.DAUNet is composed of three main components.First,we introduce an improved BottleNeck block that employs partial convolution and strip pooling to reduce computational load and mitigate feature loss.Second,after upsampling,the network eliminates redundant features,improving the overall efficiency.Finally,a lightweight spatial grouping attention mechanism is applied to enhance low-resolution semantic features within the feature map,allowing for better restoration of the original image size and higher accuracy.Experimental results demonstrate that DAUNet achieves superior accuracy compared to most existing keypoint detection models,with a mean PCKh@0.5 score of 91.6%on the MPII dataset and an AP of 76.1%on the COCO dataset.Moreover,real-world experiments further validate the robustness and generalizability of DAUNet for detecting human bodies in unknown environments,highlighting its potential for broader applications.展开更多
Previous multi-view 3D human pose estimation methods neither correlate different human joints in each view nor model learnable correlations between the same joints in different views explicitly,meaning that skeleton s...Previous multi-view 3D human pose estimation methods neither correlate different human joints in each view nor model learnable correlations between the same joints in different views explicitly,meaning that skeleton structure information is not utilized and multi-view pose information is not completely fused.Moreover,existing graph convolutional operations do not consider the specificity of different joints and different views of pose information when processing skeleton graphs,making the correlation weights between nodes in the graph and their neighborhood nodes shared.Existing Graph Convolutional Networks(GCNs)cannot extract global and deeplevel skeleton structure information and view correlations efficiently.To solve these problems,pre-estimated multiview 2D poses are designed as a multi-view skeleton graph to fuse skeleton priors and view correlations explicitly to process occlusion problem,with the skeleton-edge and symmetry-edge representing the structure correlations between adjacent joints in each viewof skeleton graph and the view-edge representing the view correlations between the same joints in different views.To make graph convolution operation mine elaborate and sufficient skeleton structure information and view correlations,different correlation weights are assigned to different categories of neighborhood nodes and further assigned to each node in the graph.Based on the graph convolution operation proposed above,a Residual Graph Convolution(RGC)module is designed as the basic module to be combined with the simplified Hourglass architecture to construct the Hourglass-GCN as our 3D pose estimation network.Hourglass-GCNwith a symmetrical and concise architecture processes three scales ofmulti-viewskeleton graphs to extract local-to-global scale and shallow-to-deep level skeleton features efficiently.Experimental results on common large 3D pose dataset Human3.6M and MPI-INF-3DHP show that Hourglass-GCN outperforms some excellent methods in 3D pose estimation accuracy.展开更多
Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addre...Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue.Currently,most methods explore temporal consistency through refinements of the final heatmaps.The heatmaps contain the semantics information of key points,and can improve the detection quality to a certain extent.However,they are generated by features,and feature-level refinements are rarely considered.In this paper,we propose a human pose estimation framework with refinements at the feature and semantics levels.We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions.An attention mechanism is then used to fuse auxiliary features with current features.In terms of semantics,we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps.The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018,and the results demonstrate the effectiveness of our method.展开更多
Recently,stacked hourglass network has shown outstanding performance in human pose estimation.However,repeated bottom-up and top-down stride convolution operations in deep convolutional neural networks lead to a signi...Recently,stacked hourglass network has shown outstanding performance in human pose estimation.However,repeated bottom-up and top-down stride convolution operations in deep convolutional neural networks lead to a significant decrease in the initial image resolution.In order to address this problem,we propose to incorporate affinage module and residual attention module into stacked hourglass network for human pose estimation.This paper introduces a novel network architecture to replace the stacked hourglass network of up-sampling operation for getting high-resolution features.We refer to the architecture as an affinage module which is critical to improve the performance of the stacked hourglass network.Additionally,we also propose a novel residual attention module to increase the supervision of up-sample process.The effectiveness of the introduced module is evaluated on standard benchmarks.Various experimental results demonstrated that our method can achieve more accurate and more robust human pose estimation results in images with complex background.展开更多
This paper introduces a novel framework,i.e.,RFPose-OT,to enable three-dimensional(3D)human pose estimation from radio frequency(RF)signals.Different from existing methods that predict human poses from RF signals at t...This paper introduces a novel framework,i.e.,RFPose-OT,to enable three-dimensional(3D)human pose estimation from radio frequency(RF)signals.Different from existing methods that predict human poses from RF signals at the signal level directly,we consider the structure difference between the RF signals and the human poses,propose a transformation of the RF signals to the pose domain at the feature level based on the optimal transport(OT)theory,and generate human poses from the transformed features.To evaluate RFPose-OT,we build a radio system and a multi-view camera system to acquire the RF signal data and the ground-truth human poses.The experimental results in a basic indoor environment,an occlusion indoor environment,and an outdoor environment demonstrate that RFPose-OT can predict 3D human poses with higher precision than state-of-the-art methods.展开更多
In current interactive television schemes, the viewpoints should be manipulated by the user. However, there is no efficient method, to assist a user in automatically identifying and tracking the optimum viewpoint when...In current interactive television schemes, the viewpoints should be manipulated by the user. However, there is no efficient method, to assist a user in automatically identifying and tracking the optimum viewpoint when the user observes the object of interest because many objects, most often humans, move rapidly and frequently. This paper proposes a novel framework for determining and tracking the virtual camera to best capture the front of the person of interest (PoI). First, one PoI is interactively chosen in a segmented 3D scene reconstructed by space carving method. Second, key points of the human torso of the PoI are detected by using a model-based method and the human's global motion including rotation and translation is estimated by using a close-formed method with 3 corresponding points. At the last step, the front direction of PoI is tracked temporally by using the unscented particle filter (UPF). Experimental results show that the method can properly compute the front direction of the PoI and robustly track the best viewpoints.展开更多
Background In computer vision,simultaneously estimating human pose,shape,and clothing is a practical issue in real life,but remains a challenging task owing to the variety of clothing,complexity of de-formation,shorta...Background In computer vision,simultaneously estimating human pose,shape,and clothing is a practical issue in real life,but remains a challenging task owing to the variety of clothing,complexity of de-formation,shortage of large-scale datasets,and difficulty in estimating clothing style.Methods We propose a multistage weakly supervised method that makes full use of data with less labeled information for learning to estimate human body shape,pose,and clothing deformation.In the first stage,the SMPL human-body model parameters were regressed using the multi-view 2D key points of the human body.Using multi-view information as weakly supervised information can avoid the deep ambiguity problem of a single view,obtain a more accurate human posture,and access supervisory information easily.In the second stage,clothing is represented by a PCA-based model that uses two-dimensional key points of clothing as supervised information to regress the parameters.In the third stage,we predefine an embedding graph for each type of clothing to describe the deformation.Then,the mask information of the clothing is used to further adjust the deformation of the clothing.To facilitate training,we constructed a multi-view synthetic dataset that included BCNet and SURREAL.Results The Experiments show that the accuracy of our method reaches the same level as that of SOTA methods using strong supervision information while only using weakly supervised information.Because this study uses only weakly supervised information,which is much easier to obtain,it has the advantage of utilizing existing data as training data.Experiments on the DeepFashion2 dataset show that our method can make full use of the existing weak supervision information for fine-tuning on a dataset with little supervision information,compared with the strong supervision information that cannot be trained or adjusted owing to the lack of exact annotation information.Conclusions Our weak supervision method can accurately estimate human body size,pose,and several common types of clothing and overcome the issues of the current shortage of clothing data.展开更多
Scale variation is amajor challenge inmulti-person pose estimation.In scenes where persons are present at various distances,models tend to perform better on larger-scale persons,while the performance for smaller-scale...Scale variation is amajor challenge inmulti-person pose estimation.In scenes where persons are present at various distances,models tend to perform better on larger-scale persons,while the performance for smaller-scale persons often falls short of expectations.Therefore,effectively balancing the persons of different scales poses a significant challenge.So this paper proposes a newmulti-person pose estimation model called FSANet to improve themodel’s performance in complex scenes.Our model utilizes High-Resolution Network(HRNet)as the backbone and feeds the outputs of the last stage’s four branches into the DCB module.The dilated convolution-based(DCB)module employs a parallel structure that incorporates dilated convolutions with different rates to expand the receptive field of each branch.Subsequently,the attention operation-based(AOB)module performs attention operations at both branch and channel levels to enhance high-frequency features and reduce the influence of noise.Finally,predictions are made using the heatmap representation.The model can recognize images with diverse scales and more complex semantic information.Experimental results demonstrate that FSA Net achieves competitive results on the MSCOCO and MPII datasets,validating the effectiveness of our proposed approach.展开更多
Human Interaction Recognition(HIR)was one of the challenging issues in computer vision research due to the involvement of multiple individuals and their mutual interactions within video frames generated from their mov...Human Interaction Recognition(HIR)was one of the challenging issues in computer vision research due to the involvement of multiple individuals and their mutual interactions within video frames generated from their movements.HIR requires more sophisticated analysis than Human Action Recognition(HAR)since HAR focuses solely on individual activities like walking or running,while HIR involves the interactions between people.This research aims to develop a robust system for recognizing five common human interactions,such as hugging,kicking,pushing,pointing,and no interaction,from video sequences using multiple cameras.In this study,a hybrid Deep Learning(DL)and Machine Learning(ML)model was employed to improve classification accuracy and generalizability.The dataset was collected in an indoor environment with four-channel cameras capturing the five types of interactions among 13 participants.The data was processed using a DL model with a fine-tuned ResNet(Residual Networks)architecture based on 2D Convolutional Neural Network(CNN)layers for feature extraction.Subsequently,machine learning models were trained and utilized for interaction classification using six commonly used ML algorithms,including SVM,KNN,RF,DT,NB,and XGBoost.The results demonstrate a high accuracy of 95.45%in classifying human interactions.The hybrid approach enabled effective learning,resulting in highly accurate performance across different interaction types.Future work will explore more complex scenarios involving multiple individuals based on the application of this architecture.展开更多
Inpatient falls from beds in hospitals are a common problem.Such falls may result in severe injuries.This problem can be addressed by continuous monitoring of patients using cameras.Recent advancements in deep learnin...Inpatient falls from beds in hospitals are a common problem.Such falls may result in severe injuries.This problem can be addressed by continuous monitoring of patients using cameras.Recent advancements in deep learning-based video analytics have made this task of fall detection more effective and efficient.Along with fall detection,monitoring of different activities of the patients is also of significant concern to assess the improvement in their health.High computation-intensive models are required to monitor every action of the patient precisely.This requirement limits the applicability of such networks.Hence,to keep the model lightweight,the already designed fall detection networks can be extended to monitor the general activities of the patients along with the fall detection.Motivated by the same notion,we propose a novel,lightweight,and efficient patient activity monitoring system that broadly classifies the patients’activities into fall,activity,and rest classes based on their poses.The whole network comprises three sub-networks,namely a Convolutional Neural Networks(CNN)based video compression network,a Lightweight Pose Network(LPN)and a Residual Network(ResNet)Mixer block-based activity recognition network.The compression network compresses the video streams using deep learning networks for efficient storage and retrieval;after that,LPN estimates human poses.Finally,the activity recognition network classifies the patients’activities based on their poses.The proposed system shows an overall accuracy of approx.99.7% over a standard dataset with 99.63% fall detection accuracy and efficiently monitors different events,which may help monitor the falls and improve the inpatients’health.展开更多
Human action recognition based on skeleton information has been extensively used in various areas,such as human-computer interaction.In this paper,we extracted human skeleton data by constructing a two-stage human pos...Human action recognition based on skeleton information has been extensively used in various areas,such as human-computer interaction.In this paper,we extracted human skeleton data by constructing a two-stage human pose estimation model,which combined the improved single shot detector(SSD)algorithm with convolutional pose machines(CPM)to obtain human skeleton heatmaps.The backbone of the SSD algorithm was replaced with ResNet,which can characterize images effectively.In addition,we designed multiscale transformation rules for CPM to fuse the information of different scales and a convolutional neural network for the classification of the skeleton keypoints heatmaps to complete action recognition.Indoor and outdoor experiments were conducted on the Caster Moma mobile robot platform,and without an external remote control,the real-time movement of the robot was controlled by the leader through command actions.展开更多
Computer vision,a scientific discipline enables machines to perceive visual information,aims to supplant human eyes in tasksencompassing object recognition,localization,and tracking.In traditional educational settings...Computer vision,a scientific discipline enables machines to perceive visual information,aims to supplant human eyes in tasksencompassing object recognition,localization,and tracking.In traditional educational settings,instructors or evaluators evaluate teachingperformance based on subjective judgment.However,with the continuous advancements in computer vision technology,it becomes increasinglycrucial for computers to take on the role of judges in obtaining vital information and making unbiased evaluations.Against thisbackdrop,this paper proposes a deep learning-based approach for evaluating lecture posture.First,feature information is extracted fromvarious dimensions,including head position,hand gestures,and body posture,using a human pose estimation algorithm.Second,a machinelearning-based regression model is employed to predict machine scores by comparing the extracted features with expert-assigned humanscores.The correlation between machine scores and human scores is investigated through experiment and analysis,revealing a robustoverall correlation(0.6420)between predicted machine scores and human scores.Under ideal scoring conditions(100 points),approximately51.72%of predicted machine scores exhibited deviations within a range of 10 points,while around 81.87%displayed deviationswithin a range of 20 points;only a minimal percentage of 0.12%demonstrated deviations exceeding the threshold of 50 points.Finally,tofurther optimize performance,additional features related to bodily movements are extracted by introducing facial expression recognitionand gesture recognition algorithms.The fusion of multiple models resulted in an overall average correlation improvement of 0.0226.展开更多
基金the National Natural Science Foundation of China(Grant Number 62076246).
文摘Human pose estimation aims to localize the body joints from image or video data.With the development of deeplearning,pose estimation has become a hot research topic in the field of computer vision.In recent years,humanpose estimation has achieved great success in multiple fields such as animation and sports.However,to obtainaccurate positioning results,existing methods may suffer from large model sizes,a high number of parameters,and increased complexity,leading to high computing costs.In this paper,we propose a new lightweight featureencoder to construct a high-resolution network that reduces the number of parameters and lowers the computingcost.We also introduced a semantic enhancement module that improves global feature extraction and networkperformance by combining channel and spatial dimensions.Furthermore,we propose a dense connected spatialpyramid pooling module to compensate for the decrease in image resolution and information loss in the network.Finally,ourmethod effectively reduces the number of parameters and complexitywhile ensuring high performance.Extensive experiments show that our method achieves a competitive performance while dramatically reducing thenumber of parameters,and operational complexity.Specifically,our method can obtain 89.9%AP score on MPIIVAL,while the number of parameters and the complexity of operations were reduced by 41%and 36%,respectively.
基金This work was supported by grants fromthe Natural Science Foundation of Hebei Province,under Grant No.F2021202021the S&T Program of Hebei,under Grant No.22375001Dthe National Key R&D Program of China,under Grant No.2019YFB1312500.
文摘Human pose estimation is a basic and critical task in the field of computer vision that involves determining the position(or spatial coordinates)of the joints of the human body in a given image or video.It is widely used in motion analysis,medical evaluation,and behavior monitoring.In this paper,the authors propose a method for multi-view human pose estimation.Two image sensors were placed orthogonally with respect to each other to capture the pose of the subject as they moved,and this yielded accurate and comprehensive results of three-dimensional(3D)motion reconstruction that helped capture their multi-directional poses.Following this,we propose a method based on 3D pose estimation to assess the similarity of the features of motion of patients with motor dysfunction by comparing differences between their range of motion and that of normal subjects.We converted these differences into Fugl–Meyer assessment(FMA)scores in order to quantify them.Finally,we implemented the proposed method in the Unity framework,and built a Virtual Reality platform that provides users with human–computer interaction to make the task more enjoyable for them and ensure their active participation in the assessment process.The goal is to provide a suitable means of assessing movement disorders without requiring the immediate supervision of a physician.
基金supported by the Program of Entrepreneurship and Innovation Ph.D.in Jiangsu Province(JSSCBS20211175)the School Ph.D.Talent Funding(Z301B2055)the Natural Science Foundation of the Jiangsu Higher Education Institutions of China(21KJB520002).
文摘3D human pose estimation is a major focus area in the field of computer vision,which plays an important role in practical applications.This article summarizes the framework and research progress related to the estimation of monocular RGB images and videos.An overall perspective ofmethods integrated with deep learning is introduced.Novel image-based and video-based inputs are proposed as the analysis framework.From this viewpoint,common problems are discussed.The diversity of human postures usually leads to problems such as occlusion and ambiguity,and the lack of training datasets often results in poor generalization ability of the model.Regression methods are crucial for solving such problems.Considering image-based input,the multi-view method is commonly used to solve occlusion problems.Here,the multi-view method is analyzed comprehensively.By referring to video-based input,the human prior knowledge of restricted motion is used to predict human postures.In addition,structural constraints are widely used as prior knowledge.Furthermore,weakly supervised learningmethods are studied and discussed for these two types of inputs to improve the model generalization ability.The problem of insufficient training datasets must also be considered,especially because 3D datasets are usually biased and limited.Finally,emerging and popular datasets and evaluation indicators are discussed.The characteristics of the datasets and the relationships of the indicators are explained and highlighted.Thus,this article can be useful and instructive for researchers who are lacking in experience and find this field confusing.In addition,by providing an overview of 3D human pose estimation,this article sorts and refines recent studies on 3D human pose estimation.It describes kernel problems and common useful methods,and discusses the scope for further research.
文摘Human pose estimation(HPE)is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision(CV)communities.HPE finds its applications in several fields namely activity recognition and human-computer interface.Despite the benefits of HPE,it is still a challenging process due to the variations in visual appearances,lighting,occlusions,dimensionality,etc.To resolve these issues,this paper presents a squirrel search optimization with a deep convolutional neural network for HPE(SSDCNN-HPE)technique.The major intention of the SSDCNN-HPE technique is to identify the human pose accurately and efficiently.Primarily,the video frame conversion process is performed and pre-processing takes place via bilateral filtering-based noise removal process.Then,the EfficientNet model is applied to identify the body points of a person with no problem constraints.Besides,the hyperparameter tuning of the EfficientNet model takes place by the use of the squirrel search algorithm(SSA).In the final stage,the multiclass support vector machine(M-SVM)technique was utilized for the identification and classification of human poses.The design of bilateral filtering followed by SSA based EfficientNetmodel for HPE depicts the novelty of the work.To demonstrate the enhanced outcomes of the SSDCNN-HPE approach,a series of simulations are executed.The experimental results reported the betterment of the SSDCNN-HPE system over the recent existing techniques in terms of different measures.
基金supported by Priority Research Centers Program through NRF funded by MEST(2018R1A6A1A03024003)the Grand Information Technology Research Center support program IITP-2020-2020-0-01612 supervised by the IITP by MSIT,Korea.
文摘In the new era of technology,daily human activities are becoming more challenging in terms of monitoring complex scenes and backgrounds.To understand the scenes and activities from human life logs,human-object interaction(HOI)is important in terms of visual relationship detection and human pose estimation.Activities understanding and interaction recognition between human and object along with the pose estimation and interaction modeling have been explained.Some existing algorithms and feature extraction procedures are complicated including accurate detection of rare human postures,occluded regions,and unsatisfactory detection of objects,especially small-sized objects.The existing HOI detection techniques are instancecentric(object-based)where interaction is predicted between all the pairs.Such estimation depends on appearance features and spatial information.Therefore,we propose a novel approach to demonstrate that the appearance features alone are not sufficient to predict the HOI.Furthermore,we detect the human body parts by using the Gaussian Matric Model(GMM)followed by object detection using YOLO.We predict the interaction points which directly classify the interaction and pair them with densely predicted HOI vectors by using the interaction algorithm.The interactions are linked with the human and object to predict the actions.The experiments have been performed on two benchmark HOI datasets demonstrating the proposed approach.
基金National Natural Science Foundation of China(61806176)the Fundamental Research Funds for the Central Universities(2019QNA5022).
文摘Recovering human pose from RGB images and videos has drawn increasing attention in recent years owing to minimum sensor requirements and applicability in diverse fields such as human-computer interaction,robotics,video analytics,and augmented reality.Although a large amount of work has been devoted to this field,3D human pose estimation based on monocular images or videos remains a very challenging task due to a variety of difficulties such as depth ambiguities,occlusion,background clutters,and lack of training data.In this survey,we summarize recent advances in monocular 3D human pose estimation.We provide a general taxonomy to cover existing approaches and analyze their capabilities and limitations.We also present a summary of extensively used datasets and metrics,and provide a quantitative comparison of some representative methods.Finally,we conclude with a discussion on realistic challenges and open problems for future research directions.
基金supported by the[Universiti Sains Malaysia]under FRGS Grant Number[FRGS/1/2020/STG07/USM/02/12(203.PKOMP.6711930)]FRGS Grant Number[304PTEKIND.6316497.USM.].
文摘In this article,a comprehensive survey of deep learning-based(DLbased)human pose estimation(HPE)that can help researchers in the domain of computer vision is presented.HPE is among the fastest-growing research domains of computer vision and is used in solving several problems for human endeavours.After the detailed introduction,three different human body modes followed by the main stages of HPE and two pipelines of twodimensional(2D)HPE are presented.The details of the four components of HPE are also presented.The keypoints output format of two popular 2D HPE datasets and the most cited DL-based HPE articles from the year of breakthrough are both shown in tabular form.This study intends to highlight the limitations of published reviews and surveys respecting presenting a systematic review of the current DL-based solution to the 2D HPE model.Furthermore,a detailed and meaningful survey that will guide new and existing researchers on DL-based 2D HPE models is achieved.Finally,some future research directions in the field of HPE,such as limited data on disabled persons and multi-training DL-based models,are revealed to encourage researchers and promote the growth of HPE research.
基金National Natural Science Foundation of China,No.61972458Natural Science Foundation of Zhejiang Province,No.LZ23F020002.
文摘Deep neural networks are vulnerable to attacks from adversarial inputs.Corresponding attack research on human pose estimation(HPE),particularly for body joint detection,has been largely unexplored.Transferring classification-based attack methods to body joint regression tasks is not straightforward.Another issue is that the attack effectiveness and imperceptibility contradict each other.To solve these issues,we propose local imperceptible attacks on HPE networks.In particular,we reformulate imperceptible attacks on body joint regression into a constrained maximum allowable attack.Furthermore,we approximate the solution using iterative gradient-based strength refinement and greedy-based pixel selection.Our method crafts effective perceptual adversarial attacks that consider both human perception and attack effectiveness.We conducted a series of imperceptible attacks against state-of-the-art HPE methods,including HigherHRNet,DEKR,and ViTPose.The experimental results demonstrate that the proposed method achieves excellent imperceptibility while maintaining attack effectiveness by significantly reducing the number of perturbed pixels.Approximately 4%of the pixels can achieve sufficient attacks on HPE.
基金supported by the Natural Science Foundation of Hubei Province of China under grant number 2022CFB536the National Natural Science Foundation of China under grant number 62367006the 15th Graduate Education Innovation Fund of Wuhan Institute of Technology under grant number CX2023579.
文摘Human pose estimation is a critical research area in the field of computer vision,playing a significant role in applications such as human-computer interaction,behavior analysis,and action recognition.In this paper,we propose a U-shaped keypoint detection network(DAUNet)based on an improved ResNet subsampling structure and spatial grouping mechanism.This network addresses key challenges in traditional methods,such as information loss,large network redundancy,and insufficient sensitivity to low-resolution features.DAUNet is composed of three main components.First,we introduce an improved BottleNeck block that employs partial convolution and strip pooling to reduce computational load and mitigate feature loss.Second,after upsampling,the network eliminates redundant features,improving the overall efficiency.Finally,a lightweight spatial grouping attention mechanism is applied to enhance low-resolution semantic features within the feature map,allowing for better restoration of the original image size and higher accuracy.Experimental results demonstrate that DAUNet achieves superior accuracy compared to most existing keypoint detection models,with a mean PCKh@0.5 score of 91.6%on the MPII dataset and an AP of 76.1%on the COCO dataset.Moreover,real-world experiments further validate the robustness and generalizability of DAUNet for detecting human bodies in unknown environments,highlighting its potential for broader applications.
基金supported in part by the National Natural Science Foundation of China under Grants 61973065,U20A20197,61973063.
文摘Previous multi-view 3D human pose estimation methods neither correlate different human joints in each view nor model learnable correlations between the same joints in different views explicitly,meaning that skeleton structure information is not utilized and multi-view pose information is not completely fused.Moreover,existing graph convolutional operations do not consider the specificity of different joints and different views of pose information when processing skeleton graphs,making the correlation weights between nodes in the graph and their neighborhood nodes shared.Existing Graph Convolutional Networks(GCNs)cannot extract global and deeplevel skeleton structure information and view correlations efficiently.To solve these problems,pre-estimated multiview 2D poses are designed as a multi-view skeleton graph to fuse skeleton priors and view correlations explicitly to process occlusion problem,with the skeleton-edge and symmetry-edge representing the structure correlations between adjacent joints in each viewof skeleton graph and the view-edge representing the view correlations between the same joints in different views.To make graph convolution operation mine elaborate and sufficient skeleton structure information and view correlations,different correlation weights are assigned to different categories of neighborhood nodes and further assigned to each node in the graph.Based on the graph convolution operation proposed above,a Residual Graph Convolution(RGC)module is designed as the basic module to be combined with the simplified Hourglass architecture to construct the Hourglass-GCN as our 3D pose estimation network.Hourglass-GCNwith a symmetrical and concise architecture processes three scales ofmulti-viewskeleton graphs to extract local-to-global scale and shallow-to-deep level skeleton features efficiently.Experimental results on common large 3D pose dataset Human3.6M and MPI-INF-3DHP show that Hourglass-GCN outperforms some excellent methods in 3D pose estimation accuracy.
基金supported by the National Key Research and Development Program of China(Nos.2021YFC2009200 and 2023YFC3606100)the Special Project of Technological Innovation and Application Development of Chongqing,China(No.cstc2019jscx-msxmX0167)。
文摘Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue.Currently,most methods explore temporal consistency through refinements of the final heatmaps.The heatmaps contain the semantics information of key points,and can improve the detection quality to a certain extent.However,they are generated by features,and feature-level refinements are rarely considered.In this paper,we propose a human pose estimation framework with refinements at the feature and semantics levels.We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions.An attention mechanism is then used to fuse auxiliary features with current features.In terms of semantics,we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps.The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018,and the results demonstrate the effectiveness of our method.
基金This work was supported by the National Natural Science Foundation of China(Grant Nos.61672375 and 61170118).
文摘Recently,stacked hourglass network has shown outstanding performance in human pose estimation.However,repeated bottom-up and top-down stride convolution operations in deep convolutional neural networks lead to a significant decrease in the initial image resolution.In order to address this problem,we propose to incorporate affinage module and residual attention module into stacked hourglass network for human pose estimation.This paper introduces a novel network architecture to replace the stacked hourglass network of up-sampling operation for getting high-resolution features.We refer to the architecture as an affinage module which is critical to improve the performance of the stacked hourglass network.Additionally,we also propose a novel residual attention module to increase the supervision of up-sample process.The effectiveness of the introduced module is evaluated on standard benchmarks.Various experimental results demonstrated that our method can achieve more accurate and more robust human pose estimation results in images with complex background.
基金supported by the National Natural Science Foundation of China(Nos.62201542 and 62172381)the National Key R&D Programmes of China(Nos.2022YFC2503405 and 2022YFC0869800)+1 种基金the Fellowship of China Postdoctoral Science Foundation(No.2022M723069)the Fundamental Research Funds for the Central Universities,China。
文摘This paper introduces a novel framework,i.e.,RFPose-OT,to enable three-dimensional(3D)human pose estimation from radio frequency(RF)signals.Different from existing methods that predict human poses from RF signals at the signal level directly,we consider the structure difference between the RF signals and the human poses,propose a transformation of the RF signals to the pose domain at the feature level based on the optimal transport(OT)theory,and generate human poses from the transformed features.To evaluate RFPose-OT,we build a radio system and a multi-view camera system to acquire the RF signal data and the ground-truth human poses.The experimental results in a basic indoor environment,an occlusion indoor environment,and an outdoor environment demonstrate that RFPose-OT can predict 3D human poses with higher precision than state-of-the-art methods.
基金the Shanghai Municipal Education Commission Project (No. SDL10026)
文摘In current interactive television schemes, the viewpoints should be manipulated by the user. However, there is no efficient method, to assist a user in automatically identifying and tracking the optimum viewpoint when the user observes the object of interest because many objects, most often humans, move rapidly and frequently. This paper proposes a novel framework for determining and tracking the virtual camera to best capture the front of the person of interest (PoI). First, one PoI is interactively chosen in a segmented 3D scene reconstructed by space carving method. Second, key points of the human torso of the PoI are detected by using a model-based method and the human's global motion including rotation and translation is estimated by using a close-formed method with 3 corresponding points. At the last step, the front direction of PoI is tracked temporally by using the unscented particle filter (UPF). Experimental results show that the method can properly compute the front direction of the PoI and robustly track the best viewpoints.
基金Supported by the National Key Research and Development Programme of China(2018YFC0831201).
文摘Background In computer vision,simultaneously estimating human pose,shape,and clothing is a practical issue in real life,but remains a challenging task owing to the variety of clothing,complexity of de-formation,shortage of large-scale datasets,and difficulty in estimating clothing style.Methods We propose a multistage weakly supervised method that makes full use of data with less labeled information for learning to estimate human body shape,pose,and clothing deformation.In the first stage,the SMPL human-body model parameters were regressed using the multi-view 2D key points of the human body.Using multi-view information as weakly supervised information can avoid the deep ambiguity problem of a single view,obtain a more accurate human posture,and access supervisory information easily.In the second stage,clothing is represented by a PCA-based model that uses two-dimensional key points of clothing as supervised information to regress the parameters.In the third stage,we predefine an embedding graph for each type of clothing to describe the deformation.Then,the mask information of the clothing is used to further adjust the deformation of the clothing.To facilitate training,we constructed a multi-view synthetic dataset that included BCNet and SURREAL.Results The Experiments show that the accuracy of our method reaches the same level as that of SOTA methods using strong supervision information while only using weakly supervised information.Because this study uses only weakly supervised information,which is much easier to obtain,it has the advantage of utilizing existing data as training data.Experiments on the DeepFashion2 dataset show that our method can make full use of the existing weak supervision information for fine-tuning on a dataset with little supervision information,compared with the strong supervision information that cannot be trained or adjusted owing to the lack of exact annotation information.Conclusions Our weak supervision method can accurately estimate human body size,pose,and several common types of clothing and overcome the issues of the current shortage of clothing data.
基金supported in part by the National Natural Science Foundation of China 6167246662011530130,Joint Fund of Zhejiang Provincial Natural Science Foundation LSZ19F010001.
文摘Scale variation is amajor challenge inmulti-person pose estimation.In scenes where persons are present at various distances,models tend to perform better on larger-scale persons,while the performance for smaller-scale persons often falls short of expectations.Therefore,effectively balancing the persons of different scales poses a significant challenge.So this paper proposes a newmulti-person pose estimation model called FSANet to improve themodel’s performance in complex scenes.Our model utilizes High-Resolution Network(HRNet)as the backbone and feeds the outputs of the last stage’s four branches into the DCB module.The dilated convolution-based(DCB)module employs a parallel structure that incorporates dilated convolutions with different rates to expand the receptive field of each branch.Subsequently,the attention operation-based(AOB)module performs attention operations at both branch and channel levels to enhance high-frequency features and reduce the influence of noise.Finally,predictions are made using the heatmap representation.The model can recognize images with diverse scales and more complex semantic information.Experimental results demonstrate that FSA Net achieves competitive results on the MSCOCO and MPII datasets,validating the effectiveness of our proposed approach.
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.RS-2023-00218176)and the Soonchunhyang University Research Fund.
文摘Human Interaction Recognition(HIR)was one of the challenging issues in computer vision research due to the involvement of multiple individuals and their mutual interactions within video frames generated from their movements.HIR requires more sophisticated analysis than Human Action Recognition(HAR)since HAR focuses solely on individual activities like walking or running,while HIR involves the interactions between people.This research aims to develop a robust system for recognizing five common human interactions,such as hugging,kicking,pushing,pointing,and no interaction,from video sequences using multiple cameras.In this study,a hybrid Deep Learning(DL)and Machine Learning(ML)model was employed to improve classification accuracy and generalizability.The dataset was collected in an indoor environment with four-channel cameras capturing the five types of interactions among 13 participants.The data was processed using a DL model with a fine-tuned ResNet(Residual Networks)architecture based on 2D Convolutional Neural Network(CNN)layers for feature extraction.Subsequently,machine learning models were trained and utilized for interaction classification using six commonly used ML algorithms,including SVM,KNN,RF,DT,NB,and XGBoost.The results demonstrate a high accuracy of 95.45%in classifying human interactions.The hybrid approach enabled effective learning,resulting in highly accurate performance across different interaction types.Future work will explore more complex scenarios involving multiple individuals based on the application of this architecture.
基金the Deanship of Scientific Research at Majmaah University for funding this work under Project No.R-2023-667.
文摘Inpatient falls from beds in hospitals are a common problem.Such falls may result in severe injuries.This problem can be addressed by continuous monitoring of patients using cameras.Recent advancements in deep learning-based video analytics have made this task of fall detection more effective and efficient.Along with fall detection,monitoring of different activities of the patients is also of significant concern to assess the improvement in their health.High computation-intensive models are required to monitor every action of the patient precisely.This requirement limits the applicability of such networks.Hence,to keep the model lightweight,the already designed fall detection networks can be extended to monitor the general activities of the patients along with the fall detection.Motivated by the same notion,we propose a novel,lightweight,and efficient patient activity monitoring system that broadly classifies the patients’activities into fall,activity,and rest classes based on their poses.The whole network comprises three sub-networks,namely a Convolutional Neural Networks(CNN)based video compression network,a Lightweight Pose Network(LPN)and a Residual Network(ResNet)Mixer block-based activity recognition network.The compression network compresses the video streams using deep learning networks for efficient storage and retrieval;after that,LPN estimates human poses.Finally,the activity recognition network classifies the patients’activities based on their poses.The proposed system shows an overall accuracy of approx.99.7% over a standard dataset with 99.63% fall detection accuracy and efficiently monitors different events,which may help monitor the falls and improve the inpatients’health.
基金This study was supported by the National Natural Science Founda-tion of China(Grant Nos.91948201 and 62073191).
文摘Human action recognition based on skeleton information has been extensively used in various areas,such as human-computer interaction.In this paper,we extracted human skeleton data by constructing a two-stage human pose estimation model,which combined the improved single shot detector(SSD)algorithm with convolutional pose machines(CPM)to obtain human skeleton heatmaps.The backbone of the SSD algorithm was replaced with ResNet,which can characterize images effectively.In addition,we designed multiscale transformation rules for CPM to fuse the information of different scales and a convolutional neural network for the classification of the skeleton keypoints heatmaps to complete action recognition.Indoor and outdoor experiments were conducted on the Caster Moma mobile robot platform,and without an external remote control,the real-time movement of the robot was controlled by the leader through command actions.
基金Supported by the Open Fund of Key Laboratory of Anhui Higher Education Institutes(CS2021-07)the National Natural Science Foundation of China(61701004)the Outstanding Young Talents Support Program of Anhui Province(gxyq2021178)。
文摘Computer vision,a scientific discipline enables machines to perceive visual information,aims to supplant human eyes in tasksencompassing object recognition,localization,and tracking.In traditional educational settings,instructors or evaluators evaluate teachingperformance based on subjective judgment.However,with the continuous advancements in computer vision technology,it becomes increasinglycrucial for computers to take on the role of judges in obtaining vital information and making unbiased evaluations.Against thisbackdrop,this paper proposes a deep learning-based approach for evaluating lecture posture.First,feature information is extracted fromvarious dimensions,including head position,hand gestures,and body posture,using a human pose estimation algorithm.Second,a machinelearning-based regression model is employed to predict machine scores by comparing the extracted features with expert-assigned humanscores.The correlation between machine scores and human scores is investigated through experiment and analysis,revealing a robustoverall correlation(0.6420)between predicted machine scores and human scores.Under ideal scoring conditions(100 points),approximately51.72%of predicted machine scores exhibited deviations within a range of 10 points,while around 81.87%displayed deviationswithin a range of 20 points;only a minimal percentage of 0.12%demonstrated deviations exceeding the threshold of 50 points.Finally,tofurther optimize performance,additional features related to bodily movements are extracted by introducing facial expression recognitionand gesture recognition algorithms.The fusion of multiple models resulted in an overall average correlation improvement of 0.0226.