Multi-view multi-person 3D human pose estimation is a hot topic in the field of human pose estimation due to its wide range of application scenarios.With the introduction of end-to-end direct regression methods,the fi...Multi-view multi-person 3D human pose estimation is a hot topic in the field of human pose estimation due to its wide range of application scenarios.With the introduction of end-to-end direct regression methods,the field has entered a new stage of development.However,the regression results of joints that are more heavily influenced by external factors are not accurate enough even for the optimal method.In this paper,we propose an effective feature recalibration module based on the channel attention mechanism and a relative optimal calibration strategy,which is applied to themulti-viewmulti-person 3D human pose estimation task to achieve improved detection accuracy for joints that are more severely affected by external factors.Specifically,it achieves relative optimal weight adjustment of joint feature information through the recalibration module and strategy,which enables the model to learn the dependencies between joints and the dependencies between people and their corresponding joints.We call this method as the Efficient Recalibration Network(ER-Net).Finally,experiments were conducted on two benchmark datasets for this task,Campus and Shelf,in which the PCP reached 97.3% and 98.3%,respectively.展开更多
Device-free gesture recognition is an emerging wireless sensing technique which could recognize gestures by analyzing its influence on surrounding wireless signals,it may empower wireless networks with the augmented s...Device-free gesture recognition is an emerging wireless sensing technique which could recognize gestures by analyzing its influence on surrounding wireless signals,it may empower wireless networks with the augmented sensing ability.Researchers have made great achievements for singleperson device-free gesture recognition.However,when multiple persons conduct gestures simultaneously,the received signals will be mixed together,and thus traditional methods would not work well anymore.Moreover,the anonymity of persons and the change in the surrounding environment would cause feature shift and mismatch,and thus the recognition accuracy would degrade remarkably.To address these problems,we explore and exploit the diversity of spatial information and propose a multidimensional analysis method to separate the gesture feature of each person using a focusing sensing strategy.Meanwhile,we also present a deep-learning based robust device free gesture recognition framework,which leverages an adversarial approach to extract robust gesture feature that is insensitive to the change of persons and environment.Furthermore,we also develop a 77GHz mmWave prototype system and evaluate the proposed methods extensively.Experimental results reveal that the proposed system can achieve average accuracies of 93%and 84%when 10 gestures are conducted in Received:Jun.18,2020 Revised:Aug.06,2020 Editor:Ning Ge different environments by two and four persons simultaneously,respectively.展开更多
Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the lat...Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.展开更多
Existing multi-person reconstruction methods require the human bodies in the input image to occupy a considerable portion of the picture.However,low-resolution human objects are ubiquitous due to trade-offbetween the ...Existing multi-person reconstruction methods require the human bodies in the input image to occupy a considerable portion of the picture.However,low-resolution human objects are ubiquitous due to trade-offbetween the field of view and target distance given a limited camera resolution.In this paper,we propose an end-to-end multi-task framework for multi-person inference from a low-resolution image(MILI).To perceive more information from a low-resolution image,we use pair-wise images at high resolution and low resolution for training,and design a restoration network with a simple loss for better feature extraction from the low-resolution image.To address the occlusion problem in multi-person scenes,we propose an occlusion-aware mask prediction network to estimate the mask of each person during 3D mesh regression.Experimental results on both small-scale scenes and large-scale scenes demonstrate that our method outperforms the state-of-the-art methods both quantitatively and qualitatively.The code is available at http://cic.tju.edu.cn/faculty/likun/projects/MILI.展开更多
Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addre...Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue.Currently,most methods explore temporal consistency through refinements of the final heatmaps.The heatmaps contain the semantics information of key points,and can improve the detection quality to a certain extent.However,they are generated by features,and feature-level refinements are rarely considered.In this paper,we propose a human pose estimation framework with refinements at the feature and semantics levels.We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions.An attention mechanism is then used to fuse auxiliary features with current features.In terms of semantics,we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps.The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018,and the results demonstrate the effectiveness of our method.展开更多
For multi-person 2D pose estimation,current deep learning baised methods have exhibited impressive performance,but the trade-offs among efficiency,robustness,and accuracy in the existing approaches remain unavoidable....For multi-person 2D pose estimation,current deep learning baised methods have exhibited impressive performance,but the trade-offs among efficiency,robustness,and accuracy in the existing approaches remain unavoidable.In principle,bottom-up methods are superior to top-down methods in efficiency,but they perform worse in accuracy.To make full use of their respective advantages,in this paper we design a novel bidirectional optimization coupled lightweight network(BOCLN)architecture for efficient,robust,and general-purpose multi-person 2D(2-dimensional)pose estimation from natural images.With the BOCLN framework,the bottom-up network focuses oil global features,while the top-down net work places emphasis on det ailed features.The entire framework shares global features along the bottom-up data stream,while the top-down data stream aims to accelerate the accurate pose estimation.In particular,to exploit the priors of human joints'relationship,we propose a probability limb heat map to represent the spatial context of the joints and guide the overall pose skeleton prediction,so that each person's pose estimation in cluttered scenes(involving crowd)could be as accurate and robust as possible.Therefore,benefiting from the novel BOCLN architecture,the tinie-consuming refinement procedure could be much simplified to an efficient lightweight network.Extensive experiments and evaluations on public benchmarks have confirmed that our new method is more efficient and robust,yet still attain competitive accuracy performance compared with the state-of-the-art methods.Our BOCLN shows even greater promise in online applications.展开更多
This paper presents a multi-person vision tracking approach based on human body localization features to address the problem of interactive object localization and tracking in a home monitoring scenario.Firstly,the hu...This paper presents a multi-person vision tracking approach based on human body localization features to address the problem of interactive object localization and tracking in a home monitoring scenario.Firstly,the human body localization model is used to obtain the 3D position of the human body,which is then used to construct the human body motion model based on the Kalman filter method.At the same time,the human appearance model is constructed by fusing human color features and features of the histogram of oriented gradient to better characterize the human body.Secondly,the human body observation model is constructed based on the human body motion model and appearance model to measure the similarities between the human body state sequence in the historical frame and the human body observation result in the current frame,and the cost matrix is then obtained.Thirdly,the Hungarian maximum matching algorithm is employed to match each human body in the current and historical frames,and the exception detection mechanism is simultaneously constructed to further reduce the probability of human tracking and matching failure.Finally,a multi-person vision tracking verification platform was constructed,and the achieved average accuracy was 96.6%in the case of human body overlapping,occlusion,disappearance,and appearance;this verifies the feasibility and effectiveness of the proposed method.展开更多
We present a multiview method for markerless motion capture of multiple people. The main challenge in this problem is to determine crossview correspondences for the 2 D joints in the presence of noise. We propose a 3 ...We present a multiview method for markerless motion capture of multiple people. The main challenge in this problem is to determine crossview correspondences for the 2 D joints in the presence of noise. We propose a 3 D hypothesis clustering technique to solve this problem. The core idea is to transform joint matching in 2 D space into a clustering problem in a 3 D hypothesis space. In this way, evidence from photometric appearance, multiview geometry, and bone length can be integrated to solve the clustering problem efficiently and robustly. Each cluster encodes a set of matched 2 D joints for the same person across different views, from which the 3 D joints can be effectively inferred. We then assemble the inferred 3 D joints to form full-body skeletons for all persons in a bottom–up way. Our experiments demonstrate the robustness of our approach even in challenging cases with heavy occlusion,closely interacting people, and few cameras. We have evaluated our method on many datasets, and our results show that it has significantly lower estimation errors than many state-of-the-art methods.展开更多
基金supported in part by the Key Program of NSFC (Grant No.U1908214)Special Project of Central Government Guiding Local Science and Technology Development (Grant No.2021JH6/10500140)+3 种基金Program for the Liaoning Distinguished Professor,Program for Innovative Research Team in University of Liaoning Province (LT2020015)Dalian (2021RT06)and Dalian University (XLJ202010)the Science and Technology Innovation Fund of Dalian (Grant No.2020JJ25CY001)Dalian University Scientific Research Platform Project (No.202101YB03).
文摘Multi-view multi-person 3D human pose estimation is a hot topic in the field of human pose estimation due to its wide range of application scenarios.With the introduction of end-to-end direct regression methods,the field has entered a new stage of development.However,the regression results of joints that are more heavily influenced by external factors are not accurate enough even for the optimal method.In this paper,we propose an effective feature recalibration module based on the channel attention mechanism and a relative optimal calibration strategy,which is applied to themulti-viewmulti-person 3D human pose estimation task to achieve improved detection accuracy for joints that are more severely affected by external factors.Specifically,it achieves relative optimal weight adjustment of joint feature information through the recalibration module and strategy,which enables the model to learn the dependencies between joints and the dependencies between people and their corresponding joints.We call this method as the Efficient Recalibration Network(ER-Net).Finally,experiments were conducted on two benchmark datasets for this task,Campus and Shelf,in which the PCP reached 97.3% and 98.3%,respectively.
基金This work was supported by National Natural Science Foundation of China under grants U1933104 and 62071081LiaoNing Revitalization Talents Program under grant XLYC1807019,Liaoning Province Natural Science Foundation under grants 2019-MS-058+1 种基金Dalian Science and Technology Innovation Foundation under grant 2018J12GX044Fundamental Research Funds for the Central Universities under grants DUT20LAB113 and DUT20JC07,and Cooperative Scientific Research Project of Chunhui Plan of Ministry of Education.
文摘Device-free gesture recognition is an emerging wireless sensing technique which could recognize gestures by analyzing its influence on surrounding wireless signals,it may empower wireless networks with the augmented sensing ability.Researchers have made great achievements for singleperson device-free gesture recognition.However,when multiple persons conduct gestures simultaneously,the received signals will be mixed together,and thus traditional methods would not work well anymore.Moreover,the anonymity of persons and the change in the surrounding environment would cause feature shift and mismatch,and thus the recognition accuracy would degrade remarkably.To address these problems,we explore and exploit the diversity of spatial information and propose a multidimensional analysis method to separate the gesture feature of each person using a focusing sensing strategy.Meanwhile,we also present a deep-learning based robust device free gesture recognition framework,which leverages an adversarial approach to extract robust gesture feature that is insensitive to the change of persons and environment.Furthermore,we also develop a 77GHz mmWave prototype system and evaluate the proposed methods extensively.Experimental results reveal that the proposed system can achieve average accuracies of 93%and 84%when 10 gestures are conducted in Received:Jun.18,2020 Revised:Aug.06,2020 Editor:Ning Ge different environments by two and four persons simultaneously,respectively.
文摘Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.
基金partly supported by the National Natural Science Foundation of China(62122058,62171317,and 62231018).
文摘Existing multi-person reconstruction methods require the human bodies in the input image to occupy a considerable portion of the picture.However,low-resolution human objects are ubiquitous due to trade-offbetween the field of view and target distance given a limited camera resolution.In this paper,we propose an end-to-end multi-task framework for multi-person inference from a low-resolution image(MILI).To perceive more information from a low-resolution image,we use pair-wise images at high resolution and low resolution for training,and design a restoration network with a simple loss for better feature extraction from the low-resolution image.To address the occlusion problem in multi-person scenes,we propose an occlusion-aware mask prediction network to estimate the mask of each person during 3D mesh regression.Experimental results on both small-scale scenes and large-scale scenes demonstrate that our method outperforms the state-of-the-art methods both quantitatively and qualitatively.The code is available at http://cic.tju.edu.cn/faculty/likun/projects/MILI.
基金supported by the National Key Research and Development Program of China(Nos.2021YFC2009200 and 2023YFC3606100)the Special Project of Technological Innovation and Application Development of Chongqing,China(No.cstc2019jscx-msxmX0167)。
文摘Due to factors such as motion blur,video out-of-focus,and occlusion,multi-frame human pose estimation is a challenging task.Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue.Currently,most methods explore temporal consistency through refinements of the final heatmaps.The heatmaps contain the semantics information of key points,and can improve the detection quality to a certain extent.However,they are generated by features,and feature-level refinements are rarely considered.In this paper,we propose a human pose estimation framework with refinements at the feature and semantics levels.We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions.An attention mechanism is then used to fuse auxiliary features with current features.In terms of semantics,we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps.The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018,and the results demonstrate the effectiveness of our method.
基金the National Natural Science Foundation of China under Grant Nos.61672077 and 61532002the Applied Basic Research Program of Qingdao under Grant No.161013xxthe National Science Foundation of USA under Grant Nos.US-0949467.IIS-1047715,IIS-1715985,IIS61672149,and IIS-1049448.
文摘For multi-person 2D pose estimation,current deep learning baised methods have exhibited impressive performance,but the trade-offs among efficiency,robustness,and accuracy in the existing approaches remain unavoidable.In principle,bottom-up methods are superior to top-down methods in efficiency,but they perform worse in accuracy.To make full use of their respective advantages,in this paper we design a novel bidirectional optimization coupled lightweight network(BOCLN)architecture for efficient,robust,and general-purpose multi-person 2D(2-dimensional)pose estimation from natural images.With the BOCLN framework,the bottom-up network focuses oil global features,while the top-down net work places emphasis on det ailed features.The entire framework shares global features along the bottom-up data stream,while the top-down data stream aims to accelerate the accurate pose estimation.In particular,to exploit the priors of human joints'relationship,we propose a probability limb heat map to represent the spatial context of the joints and guide the overall pose skeleton prediction,so that each person's pose estimation in cluttered scenes(involving crowd)could be as accurate and robust as possible.Therefore,benefiting from the novel BOCLN architecture,the tinie-consuming refinement procedure could be much simplified to an efficient lightweight network.Extensive experiments and evaluations on public benchmarks have confirmed that our new method is more efficient and robust,yet still attain competitive accuracy performance compared with the state-of-the-art methods.Our BOCLN shows even greater promise in online applications.
基金the Natural Science Foundation of Shanghai Municipality(Grant No.18ZR1415100)the National Natural Science Foundation of China(Grant No.61703262)。
文摘This paper presents a multi-person vision tracking approach based on human body localization features to address the problem of interactive object localization and tracking in a home monitoring scenario.Firstly,the human body localization model is used to obtain the 3D position of the human body,which is then used to construct the human body motion model based on the Kalman filter method.At the same time,the human appearance model is constructed by fusing human color features and features of the histogram of oriented gradient to better characterize the human body.Secondly,the human body observation model is constructed based on the human body motion model and appearance model to measure the similarities between the human body state sequence in the historical frame and the human body observation result in the current frame,and the cost matrix is then obtained.Thirdly,the Hungarian maximum matching algorithm is employed to match each human body in the current and historical frames,and the exception detection mechanism is simultaneously constructed to further reduce the probability of human tracking and matching failure.Finally,a multi-person vision tracking verification platform was constructed,and the achieved average accuracy was 96.6%in the case of human body overlapping,occlusion,disappearance,and appearance;this verifies the feasibility and effectiveness of the proposed method.
基金partially supported by National Natural Science Foundation of China(No.61872317)Face Unity Technology。
文摘We present a multiview method for markerless motion capture of multiple people. The main challenge in this problem is to determine crossview correspondences for the 2 D joints in the presence of noise. We propose a 3 D hypothesis clustering technique to solve this problem. The core idea is to transform joint matching in 2 D space into a clustering problem in a 3 D hypothesis space. In this way, evidence from photometric appearance, multiview geometry, and bone length can be integrated to solve the clustering problem efficiently and robustly. Each cluster encodes a set of matched 2 D joints for the same person across different views, from which the 3 D joints can be effectively inferred. We then assemble the inferred 3 D joints to form full-body skeletons for all persons in a bottom–up way. Our experiments demonstrate the robustness of our approach even in challenging cases with heavy occlusion,closely interacting people, and few cameras. We have evaluated our method on many datasets, and our results show that it has significantly lower estimation errors than many state-of-the-art methods.