摘要
针对传统手势识别方法没有综合考虑手势的全局空间、局部空间、时序等特征信息,提取的特征通常很难全面表征手势之间的区别,提出了一种卷积神经网络(CNN)和Transformer网络相结合的网络结构。首先对输入视频序列的每一帧使用轻量化MobileNet V3卷积神经网络来提取空间特征信息,再将输出经过pathch embedding后加上时序嵌入序列,输入到Transformer模型中利用注意力机制来提取手势的全局注意力特征和时序特征。并在两个公开数据集DHG-14/28和VIVA上进行了实验,与经典方法相比,平均识别精度分别提升了2.38%、1.87%和3.74%。实验结果表明,提出的方法能够准确地提取动态手势序列的特征并表征手势类别。
Traditional gesture recognition methods do not comprehensively consider the global spacial feature,local spacial feature,temporal feature and other feature information of hand gestures,and the extracted features are usually difficult to fully represent the difference between hand gestures.In response to this problem,a network structure that combines a convolutional neural network and a Transformer network was proposed.Specifically,the LiteMobileNet V3 convolutional neural network is used to extract spatial feature information for each of the input video sequences.The output was then fed into patch embedding and added to the temporal embedding,and then the Transformer model was fed to use the attention mechanism to extract the global attention features and temporal features of hand gestures.Experiments were conducted on two public datasets DHG-14/28 and VIVA Hand Gesture.The average recognition accuracy of DHG-14,DHG-28 and VIVA data sets was improved by 2.38%,1.87%and 3.74%,respectively.Experimental results show that the proposed method can accurately extract the features of dynamic gesture sequences and represent hand gesture categories.
作者
王丰平
张云
WANG Fengping;ZHANG Yun(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504,China;Key Laboratory of Applications of Computer Technology of Yunnan Province,Kunming 650504,China)
出处
《陕西理工大学学报(自然科学版)》
2023年第4期35-43,共9页
Journal of Shaanxi University of Technology:Natural Science Edition
基金
国家自然科学基金项目(61262043)
云南省科技计划项目(2011FZ029)
云南省重点实验室开放基金项目(2020106)。