摘要
目的三维人体姿态估计是计算机视觉的研究热点之一,当前大多数方法直接从视频或二维坐标点回归人体三维关节坐标,忽略了关节旋转角的估计。但是,人体关节旋转角对于一些虚拟现实、计算机动画应用至关重要。为此,本文提出一种能同时估计三维人体坐标及旋转角的注意力融合网络。方法首先应用骨骼长度网络和骨骼方向网络分别从2D人体姿态序列中估计出人体骨骼长度和骨骼方向,并据此计算出初步的三维人体坐标,然后将初步的三维坐标输入关节旋转角估计网络得到关节旋转角,并应用前向运动学(forward kinematics,FK)层计算与关节旋转角对应的三维人体坐标。但由于网络模块的误差累积,与关节旋转角对应的三维人体坐标比初步的三维坐标精度有所降低,但是FK层输出的三维坐标具有更稳定的骨架结构。因此,为了综合这两种三维坐标序列的优势,最后通过注意力融合模块将初步的三维坐标及与关节旋转角对应的三维人体坐标融合为最终的三维关节坐标。这种分步估计的人体姿态估计算法,能够对估计的中间状态加以约束,并且使用注意力融合机制综合了高精度和骨骼稳定性的特点,使得最终结果的精度得到提升。另外,设计了一种专门的根关节处理模块,能够输出更高精度的根关节坐标,从而进一步提升三维人体坐标的精度和平滑性。结果实验在Human3.6M数据集上与对比方法比较平均关节位置误差(mean per joint position error,MPJPE),结果表明,与能够同时计算关节点坐标和旋转角的工作相比,本文方法取得了最好的精度。结论本文提出的方法能够同时从视频中估计人体关节坐标和关节旋转角度,并且得到的人体关节坐标比现有方法具有更高的精度。
Objective Three-dimensional human pose estimation has always been a research hotspot in computer vision.Currently,most methods directly regress three-dimensional joint coordinates from videos or two-dimensional coordinate points,ignoring the estimation of joint rotation angles.However,joint rotation angles are crucial for certain applications,such as virtual reality and computer animation.To address this issue,we propose an attention fusion network for estimating three-dimensional human coordinates and rotation angles.Furthermore,many existing methods for video or motion sequence-based human pose estimation lack a dedicated network for handling the root joint separately.This limitation results in reduced overall coordinate accuracy,especially when the subject moves extensively within the scene,leading to drift and jitter phenomena.To tackle this problem,we also introduce a root joint processing approach,which ensures smoother and more stable motion of the root joint in the generated poses.Method Our proposed attention fusion network for estimating three-dimensional human coordinates and rotation angles follows a two-step approach.First,we use a wellestablished 2D pose estimation algorithm to estimate the 2D motion sequence from video or image sequences.Then,we employ a skeleton length network and a skeleton direction network to estimate the bone lengths and bone directions of the human body from the 2D human motion sequence.Based on these estimates,we calculate the initial 3D human coordinates.Next,we input the initial 3D coordinates into a joint rotation angle estimation network to obtain the joint rotation angles.We then apply forward kinematics to compute the 3D human coordinates corresponding to the joint rotation angles.However,given network errors,the precision of the 3D coordinates corresponding to the joint rotation angles is slightly lower than that of the initial 3D coordinates.To address this issue,we propose a final step where we use an attention fusion module to integrate the initial 3D coordinates and the 3D coordinates corresponding to the joint rotation angles into the final 3D joint coordinates.This stepwise estimation algorithm for human pose estimation allows for constraints on the intermediate states of the estimation.Moreover,the attention fusion mechanism helps mitigate the accuracy loss caused by the errors in the joint rotation angle network,resulting in improved precision in the final results.Result We select several representative methods and conduct experiments on the Human3.6M dataset to compare their performance in terms of the mean per joint position error(MPJPE)metric.The Human3.6M dataset is one of the largest publicly available datasets in the field of human pose estimation.It consists of seven different subjects,each performing 15 different actions captured by four cameras.Each action is annotated with 2D and 3D pose annotations and camera intrinsic and extrinsic parameters.The actions in the dataset include walking,jumping,and fist-clenching,covering a wide range of human daily activities.Experimental results demonstrate that our proposed method achieves highly competitive results.The average MPJPE achieved by our method is 45.0 mm across all actions,and it achieves the best average MPJPE in some actions while obtaining the secondbest average MPJPE in most of the other actions.The method that achieves the first-place result cannot estimate joint rotation angles while estimating 3D joint coordinates,which is precisely the strength of our proposed method.Below is an introduction to our model’s training method.We use the Adam optimizer for stochastic gradient descent and minimize the loss function.The batch size is set to 64,and the motion sequence length is set to 80.The learning rate is set to 0.001,and we train for 50 epochs.To prevent overfitting,we add dropout layers in each module with a parameter of 0.25.Conclusion To address the issue of rotation ambiguity in traditional human pose estimation methods that estimate 3D joint coordinates,we propose an attention fusion network for estimating 3D human coordinates and rotation angles.This method decomposes the 3D coordinates into skeleton lengths,skeleton directions,and joint rotation angles.First,on the basis of the skeleton lengths and directions,we calculate the initial 3D joint coordinate sequence.Then,we input the 3D and 2D coordinates into the joint rotation module to compute the joint rotation angles corresponding to the joint coordinates.However,given factors such as network errors,the precision of the 3D joint coordinates may decrease during this process.Therefore,we employ an attention fusion network to mitigate these adverse effects and obtain more accurate 3D coordinates.Through comparative experiments,we demonstrate that our proposed method not only achieves more competitive results in terms of joint coordinate estimation accuracy but also estimates the corresponding joint rotation angles simultaneously with the 3D joint coordinates from the video.
作者
薛峰
边福利
李书杰
Xue Feng;Bian Fuli;Li Shujie(School of Software,Hefei University of Technology,Hefei 230009,China;School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230009,China)
出处
《中国图象图形学报》
CSCD
北大核心
2024年第10期3116-3129,共14页
Journal of Image and Graphics
基金
国家自然科学基金项目(62272143)
安徽高校协同创新项目(GXXT-2022-054)
安徽省第七届创新创业人才特殊支持计划项目(2021-27)
安徽省重大科技专项项目(202203a05020025)。
关键词
人体姿态估计
关节坐标
关节旋转角
注意力融合
分步估计
human pose estimation
joint coordinates
joint rotation angle
attention fusion
stepwise estimation