近年来,语音驱动的3D面部动画得到了广泛的研究,虽然先前的工作可以从语音数据中生成连贯的3D面部动画,但是由于视听数据的稀缺性,生成的3D面部动画缺乏真实感和生动性,嘴唇运动的准确性不高。为了提高嘴唇运动的准确性和生动性,本文提...近年来,语音驱动的3D面部动画得到了广泛的研究,虽然先前的工作可以从语音数据中生成连贯的3D面部动画,但是由于视听数据的稀缺性,生成的3D面部动画缺乏真实感和生动性,嘴唇运动的准确性不高。为了提高嘴唇运动的准确性和生动性,本文提出了一种新的模型HBF Talk (端到端的神经网络模型),通过使用Hu BERT (Hidden-Unit BERT)预训练模型对语音数据进行特征提取和编码,引入Flash模块对提取到的语音特征表示进行进一步的编码,获得更为丰富的语音特征上下文表示,最后使用带偏置的跨模态Transformer解码器进行解码。本文进行了定量和定性实验,并与现有的基线模型进行比较,显示本文HBF Talk模型具有更好的性能,提高了语音驱动的嘴唇运动的准确性和生动性。In recent years, speech-driven 3D facial animation has been widely studied. Previous work on the generation of coherent 3D facial animations was reported from speech data. However, the generated 3D facial animations lacks realism and vividness due to the scarcity of audio-visual data, and the accuracy of lip movements is not sufficient. This work is performed in order to improve the accuracy and vividness of lip movement and an end-to-end neural network model, HBF Talk, is proposed. It utilizes the Hu BERT (Hidden-Unit BERT) pre-trained model for feature extraction and encoding of speech data. The Flash module is introduced to further encode the extracted speech feature representations, resulting in more enriched contextual representations of speech features. Finally, a biased cross-modal Transformer decoder is used for decoding. This paper conducts both quantitative and qualitative experiments and compares the results with existing baseline models, demonstrating the proposed HBF Talk model outperforms previous models by improving the accuracy and liveliness of speech-driven lip movements.展开更多
先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕...先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences.展开更多
文摘近年来,语音驱动的3D面部动画得到了广泛的研究,虽然先前的工作可以从语音数据中生成连贯的3D面部动画,但是由于视听数据的稀缺性,生成的3D面部动画缺乏真实感和生动性,嘴唇运动的准确性不高。为了提高嘴唇运动的准确性和生动性,本文提出了一种新的模型HBF Talk (端到端的神经网络模型),通过使用Hu BERT (Hidden-Unit BERT)预训练模型对语音数据进行特征提取和编码,引入Flash模块对提取到的语音特征表示进行进一步的编码,获得更为丰富的语音特征上下文表示,最后使用带偏置的跨模态Transformer解码器进行解码。本文进行了定量和定性实验,并与现有的基线模型进行比较,显示本文HBF Talk模型具有更好的性能,提高了语音驱动的嘴唇运动的准确性和生动性。In recent years, speech-driven 3D facial animation has been widely studied. Previous work on the generation of coherent 3D facial animations was reported from speech data. However, the generated 3D facial animations lacks realism and vividness due to the scarcity of audio-visual data, and the accuracy of lip movements is not sufficient. This work is performed in order to improve the accuracy and vividness of lip movement and an end-to-end neural network model, HBF Talk, is proposed. It utilizes the Hu BERT (Hidden-Unit BERT) pre-trained model for feature extraction and encoding of speech data. The Flash module is introduced to further encode the extracted speech feature representations, resulting in more enriched contextual representations of speech features. Finally, a biased cross-modal Transformer decoder is used for decoding. This paper conducts both quantitative and qualitative experiments and compares the results with existing baseline models, demonstrating the proposed HBF Talk model outperforms previous models by improving the accuracy and liveliness of speech-driven lip movements.
文摘先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences.