Self-Diffuser:语音驱动人脸表情的技术研究

Self-Diffuser: Research on the Technology of Speech-Driven Facial Expressions

下载PDF

导出

摘要先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences.

作者臧梦利王少波智宇陈昂

机构地区温州大学计算机与人工智能学院温州大学元宇宙与人工智能研究院

出处《计算机科学与应用》 2024年第8期236-249,共14页 Computer Science and Application

关键词 wav2vec 2.0 TRANSFORMER 扩散机制语音驱动面部动画

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献1

1阳珊,樊博,谢磊,王丽娟,宋謌平.基于BLSTM-RNN的语音驱动逼真面部动画合成[J].清华大学学报（自然科学版）,2017,57(3):250-256. 被引量：5

共引文献4

1周虎,张承明,张仁堂,杨晓霞,陈岩.红枣黑变过程中主要成分连续变化模拟方法[J].科教导刊（电子版）,2018,0(15):284-285.
2胡邦凤.财政收入预测模型研究[J].科教导刊（电子版）,2019,0(1):242-244.
3刘奕,金小峰.基于Bi-LSTM的面部特征与语音特征的映射模型[J].延边大学学报（自然科学版）,2020,46(3):215-220.
4杨国林,孙学先,锁旭宏,刘涛,曹辰.桥梁形变监测中LSTM预测方法研究[J].兰州交通大学学报,2022,41(5):1-5. 被引量：4

1王文祥,王少波,智宇,陈昂.HBF Talk:语音驱动的3D面部动画合成研究[J].计算机科学与应用,2024,14(8):168-178.
2黄颖,彭慧,李昌盛,高胜美,陈奉.LLFlowGAN:以生成对抗方式约束可逆流的低照度图像增强[J].中国图象图形学报,2024,29(1):65-79.
3吴亮,王甲祥,施汉琴,郑爱华,盛小飞.基于多尺度自适应注意力机制的视听语音分离[J].人工智能,2024(3):1-14.
4周保兴.移动应用界面中的交互动画研究[J].数字通信世界,2024(5):24-26.
5郭翠翠.基于AIGC反思民族动画电影的民族化表达与人机协同[J].声屏世界,2024(2):59-62.
6曾蔚,罗仙仙,王鸿伟.基于Transformer-LSTM的闽南语唇语识别[J].泉州师范学院学报,2024,42(2):10-17.
7刘明明,刘浩,王栋,张海燕.基于全局与序列变分自编码的图像描述生成[J].计算机应用研究,2024,41(7):2215-2220.
8吴志强,梁靖,贾蔚怡,黑静好,张少涵,陈帆,徐浩文,范思琦,纪星桦.“城元宇宙”:元宇宙赋能未来城市设计[J].城市规划学刊,2024(4):11-17.
9朱滕飞.数字媒体艺术中的交互性动画研究[J].大观（论坛）,2024(7):30-32.
10程子源,王国栋.基于Transformer的不可知类计数方法[J].青岛大学学报（工程技术版）,2024,39(2):17-23.

计算机科学与应用

2024年第8期

浏览历史

内容加载中请稍等...

Self-Diffuser:语音驱动人脸表情的技术研究

参考文献1

共引文献4

相关作者

相关机构

相关主题

浏览历史