摘要
目的3维人体姿态估计传统方法通常采用单帧点云作为输入,可能会忽略人体运动平滑度的固有先验知识,导致产生抖动伪影。目前,获取2维人体姿态标注的真实图像数据集相对容易,而采集大规模的具有高质量3维人体姿态标注的真实图像数据集进行完全监督训练有一定难度。对此,本文提出了一种新的点云序列3维人体姿态估计方法。方法首先从深度图像序列估计姿态相关点云,然后利用时序信息构建神经网络,对姿态相关点云序列的时空特征进行编码。选用弱监督深度学习,以利用大量的更容易获得的带2维人体姿态标注的数据集。最后采用多任务网络对人体姿态估计和人体运动预测进行联合训练,提高优化效果。结果在两个数据集上对本文算法进行评估。在ITOP(invariant-top view dataset)数据集上,本文方法的平均精度均值(mean average precision,mAP)比对比方法分别高0.99%、13.18%和17.96%。在NTU-RGBD数据集上,本文方法的mAP值比最先进的WSM(weakly supervised adversarial learning methods)方法高7.03%。同时,在ITOP数据集上对模型进行消融实验,验证了算法各个不同组成部分的有效性。与单任务模型训练相比,多任务网络联合进行人体姿态估计和运动预测的mAP可以提高2%以上。结论本文提出的点云序列3维人体姿态估计方法能充分利用人体运动连续性的先验知识,获得更平滑的人体姿态估计结果,在ITOP和NTU-RGBD数据集上都能获得很好的效果。采用多任务网络联合优化策略,人体姿态估计和运动预测两个任务联合优化求解,有互相促进的作用。
Objective Point cloud-based 3 D human pose estimation is one of the key aspects in computer vision.A wide range of its applications have been developing in augmented reality/virtual reality(AR/VR),human-computer interaction(HCI),motion retargeting,and virtual avatar manipulation.Current deep learning-based 3 D human pose estimation has been challenging on the following aspects:1)the 3 D human pose estimation task is constrained of the occlusion and self-occlusion ambiguity.Moreover,the noisy point clouds from depth cameras may cause difficulties to learn a proper human pose estimation model.2)Current depth-image based methods are mainly focused on single image-derived pose estimation,which may ignore the intrinsic priors of human motion smoothness and leads to jittery artifacts results on consistent point cloud sequences.The potential is to leverage point cloud sequences for high-fidelity human pose estimation via human motion smoothness enforcement.However,it is challenging to design an effective way to get human poses by modeling point cloud sequences.3)It is hard to collect large-scale real image dataset with high-quality 3 D human pose annotations for fully-supervised training,while it is easy to collect real dataset with 2 D human pose annotations.Moreover,human pose estimation is closely related to motion prediction,which aims to predict the future motion available.The challenging issue is whether 3 D human poses estimation and motion prediction can realize mutual benefit.Method We develop a method to obtain high fidelity 3 D human pose from point cloud sequence.The weakly-supervised deep learning architecture is used to learn 3 D human pose from 3 D point cloud sequences.We design a dual-level human pose estimation pipeline using point cloud sequences as input.1)The 2 D pose information is estimated from the depth maps,so that the background is removed and the pose-aware point clouds are extracted.To ensure that the normalized sequential point clouds are in the same scale,the point clouds normalization is carried out based on a fixed bounding box for all the point clouds.2)Pose encoding has been implemented via hierarchical PointNet++backbone and long short-term memory(LSTM)layers based on the spatial-temporal features of pose-aware point cloud sequences.To improve the optimization effect,a multi-task network is employed to jointly resolve human pose estimation and motion prediction problem.In order to use more training data with 2 D human pose annotations and release the ambiguity by the supervision of 2 D joints,weakly-supervised learning is adopted in our framework.Result In order to validate the performance of the proposed algorithm,several experiments are conducted on two public datasets,including invariant-top view dataset(ITOP)and NTU-RGBD dataset.The performance of our methods is compared to some popular methods including V2 VPoseNet,viewpoint invariant method(VI),Inference Embedded method and the weakly supervised adversarial learning methods(WSM).For the ITOP dataset,our mean average precision(mAP)value is 0.99%point higher than that of WSM given the threshold of 10 cm.Compared with VI and Inference Embedded method,each mAP value is 13.18%and 17.96%higher.Each of mean joint errors is 3.33 cm,5.17 cm,1.67 cm and 0.67 cm,which is lower than the VI method,Inference Embedded method,V2 V-PoseNet and WSM,respectively.The performance gain could be originated from the sequential input data and the constraints from the motion parameters like velocity and the accelerated velocity.1)The sequential data is encoded through the LSTM units,which could get the smoother prediction and improve the estimation performance.2)The motion parameters can alleviate the jitters caused by random sampling and yield the direct supervision of the joint coordinates.For the NTU-RGBD dataset,we compare our method with current WSM.The mAP value of our method is 7.03 percentage points higher than that with WSM if the threshold is set to 10 cm.At the same time,ablation experiments are carried out on the ITOP dataset to investigate the effect of multiple components.To understand the effect of the input sequential point clouds,we design experiment with different temporal receptive field of the sequential point clouds.The receptive field is set to 1 for the estimated results of the sequential data excluded.The percentage of correct keypoints(PCK)result drops to the lowest value of 88.57%when the receptive field is set to 1,the PCK values can be increased as the receptive field increases from 1 to 5,and the PCK value becomes more steadily when the receptive field is greater than 13.Our PCK value is 87.55%trained only with fully labeled data and the PCK value of the model trained with fully and weakly labeled data is 90.58%.It shows that our weakly supervised learning methods can improve the performance of our model by 2 point percentage.And,the experiments demonstrate that our weakly supervised learning method can be used for a small amount of fully labeled data as well.Compared with model trained for single task,the mAP of human pose estimation and motion prediction based on multi task network can be improved by more than 2 percentage points.Conclusion To obtain smoother human pose estimation results,our method can make full use of the prior of human motion continuity.All experiments demonstrate that our contributed components are all effective,and our method can achieve the state-of-the-art performance efficiently on ITOP dataset and NTU-RGBD dataset.The joint training strategy is valid for the mutual tasks of human pose estimation and motion prediction.With the weakly supervised method on sequential data,it can use more easy-to-access training data and our model is robust over different levels of training data annotations.It could be applied to such of scenarios,which require high-quality human poses like motion retargeting and virtual fitting.Our method has its related potentials of using sequential data as input.
作者
廖联军
钟重阳
张智恒
胡磊
张子豪
夏时洪
Liao Lianjun;Zhong Chongyang;Zhang Zhiheng;Hu Lei;Zhang Zihao;Xia Shihong(Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China;School of Information Science and Technology,North China University of Technology,Beijing 100144,China)
出处
《中国图象图形学报》
CSCD
北大核心
2022年第12期3608-3621,共14页
Journal of Image and Graphics
基金
国家重点研发计划资助(2020YFF0304701)
国家自然科学基金项目(61772499)
北京市自然科学基金项目(L182052)。
关键词
人体运动
人体姿态估计
人体运动预测
点云序列
弱监督学习
human motion
human pose estimation
human motion prediction
point cloud sequence
weakly-supervised learning