实时语音驱动的虚拟说话人被引量：2

Real-time speech driven talking avatar

导出

摘要该文实现了一个实时语音驱动的虚拟说话人面部动画方案。随着语音信号的输入,同步生成对应的面部动画。这种实时语音驱动的虚拟说话人在可视电话、虚拟会议、音视频聊天等即时通讯与娱乐媒体领域具有巨大的应用潜力。由于音素是最小的可分发音单元,因此构建音素识别器,对输入语音信号进行实时音素识别。为提高语音与口型的同步效果,改进了音素识别与输出算法。考虑协同发音影响,利用动态视素生成算法,将识别得到的音素转化为对应的面部动画参数序列。最后用参数序列驱动按照MPEG-4面部动画标准参数化的3-D头部模型,实现面部动画的同步生成。主观MOS评测结果表明:本文所实现的实时语音驱动虚拟说话人在的同步性和逼真度上的MOS评分分别达到了3.42和3.50。 This paper presents a real-time speech driven talking avatar.Unlike most talking avatars in which the speech-synchronized facial animation is generated offline,this talking avatar is able to speak with live speech input.This life-like talking avatar has many potential applications in videophones,virtual conferences,audio/video chats and entertainment.Since phonemes are the smallest units of pronunciation,a real-time phoneme recognizer was built.The synchronization between the input live speech and the facial motion used a phoneme recognition and output algorithm.The coarticulation effects are included in a dynamic viseme generation algorithm to coordinate the facial animation parameters（FAPs） from the input phonemes.The MPEG-4 compliant avatar model is driven by the generated FAPs.Tests show that the avatar motion is synchronized and natural with MOS values of 3.42 and 3.5.

作者李冰锋谢磊周祥增付中华张艳宁

机构地区西北工业大学计算机学院

出处《清华大学学报（自然科学版）》 EI CAS CSCD 北大核心 2011年第9期1180-1186,共7页 Journal of Tsinghua University(Science and Technology)

基金国家自然科学基金青年基金资助项目(60802085) 国家自然科学基金面上项目(61175018) 陕西省科技计划青年科技新星项目(2011KJXX29) 陕西省自然科学基础研究计划(2011JM8009)

关键词可视语音合成虚拟说话人面部动画 visual speech synthesis talking avatar facial animation

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1Cosatto E, Ostermann J, Graf H P, et al. Lifelike talking faces for interactive services [J].Proceedings of the IEEE, 2003, 91(9) : 1406 - 1429.
2TANG Hao, FU Yun, TU Jilin, et al. Humanoid audio-visual avatar with emotive text-to-speech synthesis [J]. IEEE Transactions on Multimedia, 2008, 10(6) : 969 -981.
3WU Zhiyong, ZHANG Shen, CAI Lianhong, et al. Real-time Synthesis of Chinese Visual Speech and Facial Expressions using MPEG-4 FAP Features in a Three-dimensional Avatar [C]//The International Conference on Spoken Language Processing, Pittsburgh, 2006 : 1802-1805.
4Pandzic I S, Forehheimer R. MPEG-4 Facial Animation [M]. New York: Wiley, 2002.
5HUANG Fujie, Cosatto E, Graf H. Triphone based units election for concatenative visual speech synthesis [C]// IEEE international conference on acoustics, speech, and signal processing. NJ: IEEE Press, 2002: 2037- 2040.
6Brand M. Voice puppetry [C]// Proceedings of the SIGGRAPH'99. NY: ACMPress, 1999:21-28.
7XIE Lei, LIU Zhiqiang. Realistic mouth synching for speech-driven talking face using articulatory modeling [J]. IEEE Transactions on Multimedia, 2007, 9(3) : 500 - 510.
8王志明,蔡莲红.动态视位模型及其参数估计[J].软件学报,2003,14(3):461-466. 被引量：8
9Young S, Evermann G, Kershaw D, et al. The HTK Book [M]. Cambridge University Engineering Department, 2009.
10王理嘉,林焘.语音学教程[M].北京大学出版社,1992.

二级参考文献10

1[1]Cohen MM, Massaro DW. Modeling coarticulation in synthetic visual speech. In: Thalmann NM, Thalmann D, eds. Models Techniques in Computer Animation. Tokyo: Springer-Verlag, 1993. 139～156.
2[2]Reveret L, Bailly G, Badin P. Mother: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. In: Yuan Bao-Zong, Huang Tai-Yi, Tang Xiao-Fang, eds. Proceedings of the 6th International Conference on Spoken Language Processing (Ⅱ). Beijing: China Military Friendship Publish, 2000. 755～758.
3[3]Brooke NM, Scott SD. Computer graphics animations of talking faces based on stochastic models. In: International Symposium on Speech, Image Processing and Neural Networks. 1994. 73～76.
4[4]Masuko T, Kobayashi T, Tamura M. Text-to-Visual speech synthesis based on parameter generation from HMM. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (Ⅵ). 1998. 3745～3748.
5[5]Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the ACM SIGGRAPH Conference on Computer Graphics. 1997. 353～360.
6[6]Cosatto E, Potamianos G, Graf HP. Audio-Visual unit selection for the synthesis of photo-realistic talking-heads. In: IEEE International Conference on Multimedia and Expo (Ⅱ). 2000. 619～622.
7[7]Steve M, Andrew B. Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. In: Yuan BZ, Huang TY, Tang XF, eds. Proceedings of the 6th International Conference on Spoken Language Processing (Ⅱ). Beijing: China Military Friendship Publish, 2000. 759～762.
8[8]International Standard. Information technology-coding of audio-visual objects (Part 2). Visual; Admendment 1: Visual extensions, ISO/IEC 14496-2: 1999/Amd.1:2000(E).
9[9]Zhong J, Olive J. Cloning synthetic talking heads. In: Proceedings of the 3rd ESCA/COCOSDA Workshop on Speech Synthesis. 1998. 26～29.
10[10]Le Goff B, Benoit C. A text-to-audiovisual-speech synthesizer for French. In: Proceedings of the 4th International Conference on Spoken Language Processing (Ⅳ). 1996. 2163～2166.

共引文献7

1李爱军,张利刚,李洋,孟昭鹏,王霞.汉语口语对话中姿态与语音信息关系初探[J].清华大学学报（自然科学版）,2008,48(S1):627-634.
2郑红娜,朱云,王岚,陈辉.汉语三维发音动作合成和动态模拟[J].集成技术,2013,2(1):23-28. 被引量：1
3王志明,陶建华.文本-视觉语音合成综述[J].计算机研究与发展,2006,43(1):145-152. 被引量：5
4张小凤,杨卫英,蔡方方,田超.汉语复韵母的三维动态视位模型[J].电声技术,2009,33(12):54-57. 被引量：3
5李皓,陈艳艳,唐朝京.唇部子运动与权重函数表征的汉语动态视位[J].信号处理,2012,28(3):322-328. 被引量：12
6曹亮,赵晖.具有情感表现力的可视语音合成研究综述[J].计算机工程与科学,2015,37(4):813-818. 被引量：3
7刘学杰,赵晖.改进参数控制的可视语音合成方法[J].计算机工程与设计,2017,38(4):989-995.

同被引文献17

1王志明,蔡莲红,艾海舟.基于数据驱动方法的汉语文本-可视语音合成(英文)[J].软件学报,2005,16(6):1054-1063. 被引量：16
2张申.虚拟说话人可视表现力的研究[D].北京:清华大学,2011.
3Feldman R S. Fundamentals of Nonverbal Behavior [M]. Cambridge, UK : Cambridge University Press, 1991.
4Munhall K, Jones J, Callan D, et al. Visual prosody and speech intelligibility: Head movement improves auditory speech perception [J]. Psychological Science, 2004, 15(2): 133- 137.
5McNeill D. Gesture and Thought [M]. Chicago, IL, USA, University of Chicago Press, 2005.
6Pelachaud C, Badler N, Steedman M. Generating facial expressions for speech [J]. Cognitive Science : A Multidisciplinary Journal, 1996, 20(1) : 1 - 46.
7ZHANG Shen, WU Zhiyong, Meng H, et al. Head movement synthesis based on semantic and prosodic features for a Chinese expressive avatar [C]// Proc ICASSP. Honolulu, HI, USA: IEEE Press, 2007: 837-840.
8Busso C, Deng Z, Grimm M, et al. Rigid head motion in expressive facial animation: Analysis and synthesis [J]. IEEE Trans on Audio, Speech and Language Processing, 2006, 15(3): 1075-1086.
9Sargin M E, Erzin E, Yemez Y, et al. Prosody-driven head-gesture animation [C]// Proc ICASSP. Honolulu, HI, USA: IEEE Press, 2007: 677- 680.
10Hofer G, Shimodaira H. Automatic head motion prediction from speech data [C]//Proc Interspeech. Grenoble, France: ISCA, 2007.

引证文献2

1李冰锋,谢磊,朱鹏程,樊博.语音驱动虚拟说话人的自然头动生成[J].清华大学学报（自然科学版）,2013,53(6):898-902.
2赵勇,蒋冬梅,Sahli Hichem.基于状态异步DBN的语音驱动面部动画合成[J].计算机工程,2014,40(2):180-183. 被引量：1

二级引证文献1

1李欣怡,张志超.语音驱动的人脸动画研究现状综述[J].计算机工程与应用,2017,53(22):21-28. 被引量：4

1李冰锋,谢磊,朱鹏程,樊博.语音驱动虚拟说话人的自然头动生成[J].清华大学学报（自然科学版）,2013,53(6):898-902.
2阳珊,樊博,谢磊,王丽娟,宋謌平.基于BLSTM-RNN的语音驱动逼真面部动画合成[J].清华大学学报（自然科学版）,2017,57(3):250-256. 被引量：5
3Face Robot 1．9最新版面世[J].电视字幕．特技与动画,2008,14(5):80-80.
4刘学杰,赵晖.改进参数控制的可视语音合成方法[J].计算机工程与设计,2017,38(4):989-995.
5唐郅,侯进.基于深度神经网络的语音驱动发音器官的运动合成[J].自动化学报,2016,42(6):923-930. 被引量：6
6王潇.基于Microsoft Speech SDK和OGRE的语音驱动三维面部动画研究[J].科技视界,2013(7):46-47.
7张力,赵玮,陈福民.基于参数模型和肌肉模型的面部动画研究[J].计算机应用与软件,2007,24(7):149-150. 被引量：2
8钱鲲.动画领域的面部动画生成方法[J].商品与质量（房地产研究）,2014(3):92-92.
9张全伙,范慧琳.结构映象法在三维面部图象合成中的应用[J].计算机应用研究,1996,13(6):43-45.
10360°旋转分离式音箱[J].工业设计,2013(8):65-65.

清华大学学报（自然科学版）

2011年第9期

浏览历史

内容加载中请稍等...

实时语音驱动的虚拟说话人被引量：2

参考文献10

二级参考文献10

共引文献7

同被引文献17

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

实时语音驱动的虚拟说话人 被引量：2

参考文献10

二级参考文献10

共引文献7

同被引文献17

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

实时语音驱动的虚拟说话人被引量：2