面向三维人体坐标及旋转角估计的注意力融合网络

Attention fusion network for estimation of 3D joint coordinates and rotation of human pose

导出

摘要目的三维人体姿态估计是计算机视觉的研究热点之一,当前大多数方法直接从视频或二维坐标点回归人体三维关节坐标,忽略了关节旋转角的估计。但是,人体关节旋转角对于一些虚拟现实、计算机动画应用至关重要。为此,本文提出一种能同时估计三维人体坐标及旋转角的注意力融合网络。方法首先应用骨骼长度网络和骨骼方向网络分别从2D人体姿态序列中估计出人体骨骼长度和骨骼方向,并据此计算出初步的三维人体坐标,然后将初步的三维坐标输入关节旋转角估计网络得到关节旋转角,并应用前向运动学(forward kinematics,FK)层计算与关节旋转角对应的三维人体坐标。但由于网络模块的误差累积,与关节旋转角对应的三维人体坐标比初步的三维坐标精度有所降低,但是FK层输出的三维坐标具有更稳定的骨架结构。因此,为了综合这两种三维坐标序列的优势,最后通过注意力融合模块将初步的三维坐标及与关节旋转角对应的三维人体坐标融合为最终的三维关节坐标。这种分步估计的人体姿态估计算法,能够对估计的中间状态加以约束,并且使用注意力融合机制综合了高精度和骨骼稳定性的特点,使得最终结果的精度得到提升。另外,设计了一种专门的根关节处理模块,能够输出更高精度的根关节坐标,从而进一步提升三维人体坐标的精度和平滑性。结果实验在Human3.6M数据集上与对比方法比较平均关节位置误差(mean per joint position error,MPJPE),结果表明,与能够同时计算关节点坐标和旋转角的工作相比,本文方法取得了最好的精度。结论本文提出的方法能够同时从视频中估计人体关节坐标和关节旋转角度,并且得到的人体关节坐标比现有方法具有更高的精度。 Objective Three-dimensional human pose estimation has always been a research hotspot in computer vision.Currently,most methods directly regress three-dimensional joint coordinates from videos or two-dimensional coordinate points,ignoring the estimation of joint rotation angles.However,joint rotation angles are crucial for certain applications,such as virtual reality and computer animation.To address this issue,we propose an attention fusion network for estimating three-dimensional human coordinates and rotation angles.Furthermore,many existing methods for video or motion sequence-based human pose estimation lack a dedicated network for handling the root joint separately.This limitation results in reduced overall coordinate accuracy,especially when the subject moves extensively within the scene,leading to drift and jitter phenomena.To tackle this problem,we also introduce a root joint processing approach,which ensures smoother and more stable motion of the root joint in the generated poses.Method Our proposed attention fusion network for estimating three-dimensional human coordinates and rotation angles follows a two-step approach.First,we use a wellestablished 2D pose estimation algorithm to estimate the 2D motion sequence from video or image sequences.Then,we employ a skeleton length network and a skeleton direction network to estimate the bone lengths and bone directions of the human body from the 2D human motion sequence.Based on these estimates,we calculate the initial 3D human coordinates.Next,we input the initial 3D coordinates into a joint rotation angle estimation network to obtain the joint rotation angles.We then apply forward kinematics to compute the 3D human coordinates corresponding to the joint rotation angles.However,given network errors,the precision of the 3D coordinates corresponding to the joint rotation angles is slightly lower than that of the initial 3D coordinates.To address this issue,we propose a final step where we use an attention fusion module to integrate the initial 3D coordinates and the 3D coordinates corresponding to the joint rotation angles into the final 3D joint coordinates.This stepwise estimation algorithm for human pose estimation allows for constraints on the intermediate states of the estimation.Moreover,the attention fusion mechanism helps mitigate the accuracy loss caused by the errors in the joint rotation angle network,resulting in improved precision in the final results.Result We select several representative methods and conduct experiments on the Human3.6M dataset to compare their performance in terms of the mean per joint position error(MPJPE)metric.The Human3.6M dataset is one of the largest publicly available datasets in the field of human pose estimation.It consists of seven different subjects,each performing 15 different actions captured by four cameras.Each action is annotated with 2D and 3D pose annotations and camera intrinsic and extrinsic parameters.The actions in the dataset include walking,jumping,and fist-clenching,covering a wide range of human daily activities.Experimental results demonstrate that our proposed method achieves highly competitive results.The average MPJPE achieved by our method is 45.0 mm across all actions,and it achieves the best average MPJPE in some actions while obtaining the secondbest average MPJPE in most of the other actions.The method that achieves the first-place result cannot estimate joint rotation angles while estimating 3D joint coordinates,which is precisely the strength of our proposed method.Below is an introduction to our model’s training method.We use the Adam optimizer for stochastic gradient descent and minimize the loss function.The batch size is set to 64,and the motion sequence length is set to 80.The learning rate is set to 0.001,and we train for 50 epochs.To prevent overfitting,we add dropout layers in each module with a parameter of 0.25.Conclusion To address the issue of rotation ambiguity in traditional human pose estimation methods that estimate 3D joint coordinates,we propose an attention fusion network for estimating 3D human coordinates and rotation angles.This method decomposes the 3D coordinates into skeleton lengths,skeleton directions,and joint rotation angles.First,on the basis of the skeleton lengths and directions,we calculate the initial 3D joint coordinate sequence.Then,we input the 3D and 2D coordinates into the joint rotation module to compute the joint rotation angles corresponding to the joint coordinates.However,given factors such as network errors,the precision of the 3D joint coordinates may decrease during this process.Therefore,we employ an attention fusion network to mitigate these adverse effects and obtain more accurate 3D coordinates.Through comparative experiments,we demonstrate that our proposed method not only achieves more competitive results in terms of joint coordinate estimation accuracy but also estimates the corresponding joint rotation angles simultaneously with the 3D joint coordinates from the video.

作者薛峰边福利李书杰 Xue Feng;Bian Fuli;Li Shujie(School of Software,Hefei University of Technology,Hefei 230009,China;School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230009,China)

机构地区合肥工业大学软件学院合肥工业大学计算机与信息学院

出处《中国图象图形学报》 CSCD 北大核心 2024年第10期3116-3129,共14页 Journal of Image and Graphics

基金国家自然科学基金项目(62272143) 安徽高校协同创新项目(GXXT-2022-054) 安徽省第七届创新创业人才特殊支持计划项目(2021-27) 安徽省重大科技专项项目(202203a05020025)。

关键词人体姿态估计关节坐标关节旋转角注意力融合分步估计 human pose estimation joint coordinates joint rotation angle attention fusion stepwise estimation

分类号 TP391.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献3

1李书杰,朱海生,王磊,刘晓平.面向人体骨骼运动数据优化的双自编码器网络[J].中国图象图形学报,2022,27(4):1277-1289. 被引量：1
2廖联军,钟重阳,张智恒,胡磊,张子豪,夏时洪.融合时序特征约束与联合优化的点云3维人体姿态序列估计[J].中国图象图形学报,2022,27(12):3608-3621. 被引量：2
3陶树,王美丽.结合姿态估计和时序分段网络分析的羽毛球视频动作识别[J].中国图象图形学报,2022,27(11):3280-3291. 被引量：5

二级参考文献4

1冯林,刘胜蓝,王静,肖尧.人体运动分割算法:序列局部弯曲的流形学习[J].计算机辅助设计与图形学学报,2013,25(4):460-467. 被引量：7
2沈晴,班晓娟,常征,郭靖.基于视频的人机交互中动作在线发现与时域分割[J].计算机学报,2015,38(12):2477-2487. 被引量：5
3杨静.体育视频中羽毛球运动员的动作识别[J].自动化技术与应用,2018,37(10):120-124. 被引量：11
4熊成鑫,郭丹,刘学亮.时域候选优化的时序动作检测[J].中国图象图形学报,2020,25(7):1447-1458. 被引量：2

共引文献5

1杨耿,梁俊威,蔡铁,李钦,郑家帆.新时代学校体育评价智慧大脑设计与构建研究[J].当代体育科技,2023,13(18):103-110. 被引量：2
2顾庆传,张靖,周丽,李鑫,朱豪,张鹏坤.基于图像识别的角度传感器设计[J].传感器与微系统,2024,43(2):113-115.
3赵金源,贾迪.改进YOLOv5的多人姿态估计修正算法[J].计算机工程与科学,2024,46(5):852-860.
4张学琪,胡海洋,潘开来,李忠金.基于多视图自适应3D骨架网络的工业装箱动作识别[J].中国图象图形学报,2024,29(5):1392-1407.
5刘容娟.基于动作捕捉技术的羽毛球训练辅助教学系统设计[J].无线互联科技,2024,21(18):76-78.

1李文博,周志松,王亚飞.基于主车和目标车辆动力学的前车转向角估计[J].机械设计与研究,2024,40(4):130-134.
2刘星,王宇晶.基于双循环Transformer的三维人体姿态估计[J].传感技术学报,2024,37(7):1236-1243.
3郭意凡,陈钲方,张路,王健,汪洋继鸿.基于TM-Net网络估计的三维人体姿态运动监测算法[J].南通职业大学学报,2024,38(1):81-86.
4葛森林,高浩.基于多特征提取的3D人体姿态估计算法[J].微电子学与计算机,2024,41(4):38-46.
5韩刚涛,王昊,汪松,陈恩庆.联合关键点数据增强和结构先验的遮挡人体姿态估计[J].计算机工程与应用,2024,60(20):254-261.
6张静,曾祥峰.道路工程无人机倾斜摄影测量技术应用[J].中国科技期刊数据库工业A,2024(10):0045-0048.
7王安澜,李方钏,张严辞.面向三维流管可视化的各向异性屏幕空间环境光遮蔽算法[J].计算机应用研究,2024,41(10):3166-3172.
8熊辉,周小宁,邓晗,黄芃.基于多视角图像的人体三维重建技术[J].电脑编程技巧与维护,2024(9):155-157.
9童宇超,姚绪辉,邹海涛,金歌,杜理强,郑明.基于激光点云的网架球节点中心点坐标提取方法[J].建筑钢结构进展,2024,26(9):93-99.
10魏伦,于洺珠,宋依凡,曾赫男,苏琴,贾学松,毕钰阳,夏倩.QFD/TRIZ理论在太白IP形象设计中的应用[J].安徽工业大学学报（社会科学版）,2024,41(3):45-49.

中国图象图形学报

2024年第10期

浏览历史

内容加载中请稍等...

面向三维人体坐标及旋转角估计的注意力融合网络

参考文献3

二级参考文献4

共引文献5

相关作者

相关机构

相关主题

浏览历史