摘要
为了去除复杂的音频切分和强制对齐过程,并在噪音环境下充分发挥说话人发音过程中发音器官的视觉作用,本文提出了一种融合唇部特征的端到端的多模态语音识别算法。本文首先对说话人视频进行处理得到对应图像集,使用基于回归树的人脸对齐算法对图像集中发音的主要视觉部分进行特征提取,并与说话人的声学特征进行对齐融合得到新的特征,然后使用支持变长输入的端到端双向长短期记忆网络模型(DeepBiLstmCtc)对特征进行处理,输出对应的音素序列。实验结果表明该算法能有效地识别出视听觉信息中的音素序列,在噪声情况下也有一定的识别率提升。
In order to remove the complex audio segmentation and forced alignment process, and give full play to the visual effect of the speaker’s articulatory organs in the speaker’s pronunciation process in a noisy environment, this paper proposes an end-to-end multi-modal speech recognition that incorporates lip features algorithm. This paper first processes the speaker’s video to obtain the corresponding image set, uses the regression tree-based face alignment algorithm to extract the features of the main visual parts of the voice in the image set, and aligns and fuses it with the speaker’s acoustic features to obtain new features, and then uses the end-to-end bidirectional long and short-term memory network model (DeepBiLstmCtc) that supports variable-length input to process the features and output the corresponding phoneme sequence. The experimental results show that the algorithm can effectively identify the phoneme sequence in the audiovisual information, and it also has a certain improvement in the recognition rate in the case of noise.
出处
《计算机科学与应用》
2021年第5期1315-1324,共10页
Computer Science and Application