期刊文献+

基于交叉注意力Transformer的人体姿态估计方法

Human Pose Estimation Method Based on Cross Attention Transformer
下载PDF
导出
摘要 现有用于人体姿态估计的深度卷积网络方法大多采用堆叠Transformer编码器技术,未充分考虑低分辨率全局语义信息,存在模型学习困难、推理成本高等问题。提出基于交叉注意力的Transformer多尺度表征学习方法。利用深度卷积网络获取不同分辨率特征图,将特征图转变为多尺度视觉标记,并且预估关键点在标记空间中的分布提高模型的收敛速度。为增强低分辨率全局语义的可识别性,提出多尺度交叉注意力模块,该模块通过对不同分辨率特征标记之间的多次交互,以及对关键点标记采取移动关键点策略,实现减少关键点标记冗余和交叉融合操作次数,交叉注意力融合模块从特征标记中抽取的不同尺度特征信息形成关键点标记,有助于降低上采样融合的不准确性。在多项基准数据集上的实验结果表明,与当前最先进的TokenPose方法相比,该方法能有效促进Transformer编码器对关键点之间关联关系的学习,在不降低性能的前提下计算代价下降11.8%。 Most existing deep convolutional network methods for human pose estimation use the stacked Transformer encoder without fully considering the low-resolution global semantic information,thus resulting in difficulties in model learning,high inference costs,and other problems.Hence,a multiscale feature representation based on the cross-attention Transformer is proposed.First,the deep convolutional network is used to obtain feature maps with different resolutions.Subsequently,these feature maps are transformed into multiscale visual tokens and the distribution of keypoints in the token space is predicted,thus improving the convergence speed of the model.To improve the identifiability of low-resolution global semantics,a multiscale cross attention module is proposed.The module reduces the redundancy of key-point tokens and the number of cross-fusion operations through multiple interactions between feature tokens of different resolutions and by shifting keypoints.Finally,the cross-attention fusion module extracts feature information of different scales from feature tokens to form keypoint tokens,thus reducing the inaccuracy of fusion.Experimental results on multiple benchmark datasets show that the effectiveness of the cross-attention and fusion modules facilitates the Transformer encoder in learning the correlation of keypoints.Compared with the current state-of-the-art TokenPose,the proposed method reduces the computational cost by 11.8%without degrading performance.
作者 王款 宣士斌 何雪东 李紫薇 李嘉祥 WANG Kuan;XUAN Shibin;HE Xuedong;LI Ziwei;LI Jiaxiang(College of Artificial Intelligence,Guangxi Minzu University,Nanning 530006,China)
出处 《计算机工程》 CAS CSCD 北大核心 2023年第7期223-231,共9页 Computer Engineering
基金 国家自然科学基金(61866003)。
关键词 全局语义 多尺度交叉注意力 人体姿态估计 表征学习 交叉注意力融合 Transformer编码器 global semantic multi-scale cross attention human pose estimation representation learning cross attention fusion Transformer encoder
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部