基于交叉注意力Transformer的人体姿态估计方法

Human Pose Estimation Method Based on Cross Attention Transformer

下载PDF

导出

摘要现有用于人体姿态估计的深度卷积网络方法大多采用堆叠Transformer编码器技术,未充分考虑低分辨率全局语义信息,存在模型学习困难、推理成本高等问题。提出基于交叉注意力的Transformer多尺度表征学习方法。利用深度卷积网络获取不同分辨率特征图,将特征图转变为多尺度视觉标记,并且预估关键点在标记空间中的分布提高模型的收敛速度。为增强低分辨率全局语义的可识别性,提出多尺度交叉注意力模块,该模块通过对不同分辨率特征标记之间的多次交互,以及对关键点标记采取移动关键点策略,实现减少关键点标记冗余和交叉融合操作次数,交叉注意力融合模块从特征标记中抽取的不同尺度特征信息形成关键点标记,有助于降低上采样融合的不准确性。在多项基准数据集上的实验结果表明,与当前最先进的TokenPose方法相比,该方法能有效促进Transformer编码器对关键点之间关联关系的学习,在不降低性能的前提下计算代价下降11.8%。 Most existing deep convolutional network methods for human pose estimation use the stacked Transformer encoder without fully considering the low-resolution global semantic information,thus resulting in difficulties in model learning,high inference costs,and other problems.Hence,a multiscale feature representation based on the cross-attention Transformer is proposed.First,the deep convolutional network is used to obtain feature maps with different resolutions.Subsequently,these feature maps are transformed into multiscale visual tokens and the distribution of keypoints in the token space is predicted,thus improving the convergence speed of the model.To improve the identifiability of low-resolution global semantics,a multiscale cross attention module is proposed.The module reduces the redundancy of key-point tokens and the number of cross-fusion operations through multiple interactions between feature tokens of different resolutions and by shifting keypoints.Finally,the cross-attention fusion module extracts feature information of different scales from feature tokens to form keypoint tokens,thus reducing the inaccuracy of fusion.Experimental results on multiple benchmark datasets show that the effectiveness of the cross-attention and fusion modules facilitates the Transformer encoder in learning the correlation of keypoints.Compared with the current state-of-the-art TokenPose,the proposed method reduces the computational cost by 11.8%without degrading performance.

作者王款宣士斌何雪东李紫薇李嘉祥 WANG Kuan;XUAN Shibin;HE Xuedong;LI Ziwei;LI Jiaxiang(College of Artificial Intelligence,Guangxi Minzu University,Nanning 530006,China)

机构地区广西民族大学人工智能学院

出处《计算机工程》 CAS CSCD 北大核心 2023年第7期223-231,共9页 Computer Engineering

基金国家自然科学基金(61866003)。

关键词全局语义多尺度交叉注意力人体姿态估计表征学习交叉注意力融合 Transformer编码器 global semantic multi-scale cross attention human pose estimation representation learning cross attention fusion Transformer encoder

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1张世豪,董峦,逄正钧,秦立浩.基于改进YOLOv5的小麦穗目标检测模型[J].农业工程,2023,13(3):50-56.
2胡军,王海峰.基于加权信息粒化的多标记数据特征选择算法[J].智能系统学报,2023,18(3):619-628.

计算机工程

2023年第7期

浏览历史

内容加载中请稍等...

基于交叉注意力Transformer的人体姿态估计方法

相关作者

相关机构

相关主题

浏览历史