摘要
人体动作识别可以为医疗、安全和娱乐等许多不同的应用程序提供基础服务,逐渐成为研究热点。为解决Vision Transformer(ViT)计算复杂度过高和参数量过大的问题,利用池化与序列长度的线性复杂性以及无参数的优势,提出MAPFormer框架模型,引入并行池化模块代替ViT的多头注意力模块,利用深度可分离卷积增强局部特征,同时进一步降低参数量,并将该方法与人体动作识别任务相结合,提高动作识别准确度。实验结果在Miniimagnet数据集和MS COCO数据集分别达到88.3%和89.1%的实验精度,相比ViT实验精度分别提高4.3%和2.1%,参数量减少65.2 M和58.3 M。
Human motion recognition is able to provide basic services for many different applications such as healthcare,safety,and entertainment,and gradually become a hotspot in the related research field.To solve the problems of high computational complexity and large parameter count in Visual Transformer(ViT),a MAPFormer framework model is proposed by utilizing the linear complexity of pooling and sequence length,as well as its advantages of unnecessary of parameters.The model introduces parallel pooling modules to replace the multi head attention module of ViT and uses deep separable convolutions to enhance local feature,meanwhile,the parameter count is further reduced.The application of method to human action recognition tasks is able to improve the accuracy of human motion recognition.The experimental results achieved 88.3%and 89.1%of the accuracy in the Minimagnet dataset and MS COCO dataset,respectively,which increased by 4.3%and 2.1%of accuracy and decreased by 65.2 M and 58.3 M in the parameter count compared with ViT.
作者
陆静芳
智敏
LU Jingfang;ZHI Min(College of Computer Science and Technology,Inner Mongolia Normal University,Hohhot 010022,China)
出处
《内蒙古师范大学学报(自然科学汉文版)》
CAS
2024年第1期44-52,共9页
Journal of Inner Mongolia Normal University(Natural Science Edition)
基金
内蒙古自治区自然科学基金资助项目“基于正交视频Transformer的跨年龄羊脸识别”(2023MS06009)
内蒙古师范大学基本科研业务费专项基金资助项目“基于WSwinTransformer的人体动作识别研究”(2022JBXC018)
内蒙古师范大学研究生科研创新基金资助项目“基于WSwinTransformer的人体动作识别研究”(CXJJS22138)。