摘要
为了更好地帮助健全人与听障或语言障碍人士进行交流,构建无障碍社会。构建了一个基于多模态融合的端到端音视频识别系统,实现中文唇语翻译功能。实验结果表明,将所提出的端到端视听语音识别结构体系应用于唇语识别模型,实现了8.0%的字符错误率。与之前的唇语识别模型相比,它在融合图像特征和音频特征方面表现出了良好的性能。
To better help able-bodied people,communicate with hearing-impaired or speech-impaired people,and build a barrier-free society.Constructs an end-to-end audio and video recognition system based on multi-modal fusion to realize the translation function of Chinese lip language.Experimental results show that applying the proposed End-to-end Visual Speech Recognition Structure System to the lip recognition model achieves a character error rate of 8.0%.Compared with previous lip recognition models,it shows good performance in fusing image features and audio features.
作者
陈焯辉
林绰雅
刘奕显
王茗琛
梁思敏
陈灵
Chen Zhuohui;Lin Chuoya;Liu Yixian;Wang Mingchen;Liang Simin;Chen Ling(Macao University of Science and Technology,Macao,China;Beijing Institute of Technology,Zhuhai,Zhuhai,China)
出处
《科学技术创新》
2023年第10期85-88,共4页
Scientific and Technological Innovation
基金
2022年度广东省大学生创新创业训练项目——基于中文唇语翻译的听障人群无障碍交流系统(S202213675010)。
关键词
端到端音视觉语音识别结构体系
多模态融合
唇语识别
end-to-end visual speech recognition structure system
multi-modal fusion
lip recognition