摘要
唇语识别作为一种将唇读视频转换为文本的多模态任务,旨在理解说话者在无声情况下表达的意思.目前唇语识别主要利用卷积神经网络提取唇部视觉特征,捕获短距离像素关系,难以区分相似发音字符的唇形.为了捕获视频图像中唇部区域像素之间的长距离关系,文中提出基于Vision Transformer(ViT)的端到端中文句子级唇语识别模型,融合ViT和门控循环单元(Gate Recurrent Unit,GRU),提高对嘴唇视频的视觉时空特征提取能力.具体地,首先使用ViT的自注意力模块提取嘴唇图像的全局空间特征,再通过GRU对帧序列时序建模,最后使用基于注意力机制的级联序列到序列模型实现对拼音和汉字语句的预测.在中文唇语识别数据集CMLR上的实验表明,文中模型的汉字错误率较低.
Lipreading is a multimodal task to convert lipreading videos into text,and it is intended to understand the meaning expressed by a speaker in the absence of sound.In the existing lipreading methods,convolutional neural networks are adopted to extract visual features of the lips and capture short-distance pixel relationships,resulting in difficulties in distinguishing lip shapes of similarly pronounced characters.To capture the long-distance relationship between pixels in the lip region of the video images,an end-to-end Chinese sentence-level lipreading model based on vision transformer(ViT)is proposed.The ability of the model to extract visual spatio-temporal features from lip videos is improved by fusing ViT and Gate Recurrent Unit(GRU).Firstly,the global spatial features of lip images are extracted using the self-attention module of ViT.Then,GRU is employed to model the temporal sequence of frames.Finally,the cascading sequence-to-sequence model based on the attention mechanism is utilized to predict Chinese pinyin and Chinese character utterances.Experimental results on Chinese lipreading dataset CMLR show that the proposed model produces a lower Chinese character error rate.
作者
薛峰
洪自坤
李书杰
李雨
谢胤岑
XUE Feng;HONG Zikun;LI Shujie;LI Yu;XIE Yincen(School of Software,Hefei University of Technology,Hefei 230601;School of Computer Science and Information Engineering,Hefei University of Technology,Hefei 230601)
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2022年第12期1111-1121,共11页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.62272143)
安徽高校协同创新项目(No.GXXT-2022-054)
安徽省重大科技专项项目(No.202203a05020025)
安徽省第七届创新创业人才特殊支持计划资助。