摘要
在无声或噪声干扰严重的环境下,或对于存在听觉障碍的人群,唇语识别至关重要。针对词语级中文唇语识别的问题,提出了SinoLipReadingNet模型,前端采用Conv3D+ResNet34结构用于时空特征提取,后端分别采用Conv1D结构和Bi-LSTM结构用于分类预测,并引入Self-Attention、CTCLoss对Bi-LSTM后端进行改进。最终在新网银行唇语识别数据集上进行实验,结果表明,SinoLipReadingNet模型在识别准确率上明显优于中科院D3D模型,多模型融合的预测准确率达到了77.64%,平均字错率为21.68%。
Lip reading is crucial in the silent environment or environments with serious noise interference,or for people with hearing impairment.For word-level Chinese lip reading problem,SinoLipReadingNet model is proposed,the front end of which with Conv3D and ResNet34 is used to extract temporal-spatial features,and the back end of which with Conv1D and Bi-LSTM are used for classification and prediction respectively.Also,self-attention and CTCLoss are added to improve the back end with Bi-LSTM.Finally,the SinoLipReadingNet model is tested on XWBank lipreading dataset and results show that the prediction accuracy is significantly better than that of D3D model,the prediction accuracy and avrage CER of multi-model fusion reaches 77.64%and 21.68%respectively.
作者
陈红顺
陈观明
Chen Hongshun;Chen Guanming(School of Information Technology,Beijing Normal University(Zhuhai),Zhuhai 519087,China;Zhuhai Orbita Aerospace Science&Technology Co.,Ltd.,Zhuhai 519080,China)
出处
《电子技术应用》
2022年第12期54-58,共5页
Application of Electronic Technique