摘要
1 Introduction The lip reading involves converting the image sequence into the corresponding text sequence.Currently,lip reading has significant applications in many fields,such as assisted speech recognition,helping the speech impaired.Lip reading belongs to fine-grained video analysis and requires the local information and the overall spatial information of sequence.Most existing approaches capture local spatial information with CNN and temporal information with RNN generally.Considering these general methods,we propose a fine-grained method based on self-attention and self-distillation.The whole model mainly includes the CNN front-end,pixel-wise learning,temporal learning,and decoder.Specifically,we apply the CNN front-end to capture shallow spatial features inside the image sequence,and employ the Resformer module including self-attention to learn the global spatial correlation between pixels,namely,pixel-wise learning.