期刊文献+

实时语音驱动的虚拟说话人 被引量:2

Real-time speech driven talking avatar
原文传递
导出
摘要 该文实现了一个实时语音驱动的虚拟说话人面部动画方案。随着语音信号的输入,同步生成对应的面部动画。这种实时语音驱动的虚拟说话人在可视电话、虚拟会议、音视频聊天等即时通讯与娱乐媒体领域具有巨大的应用潜力。由于音素是最小的可分发音单元,因此构建音素识别器,对输入语音信号进行实时音素识别。为提高语音与口型的同步效果,改进了音素识别与输出算法。考虑协同发音影响,利用动态视素生成算法,将识别得到的音素转化为对应的面部动画参数序列。最后用参数序列驱动按照MPEG-4面部动画标准参数化的3-D头部模型,实现面部动画的同步生成。主观MOS评测结果表明:本文所实现的实时语音驱动虚拟说话人在的同步性和逼真度上的MOS评分分别达到了3.42和3.50。 This paper presents a real-time speech driven talking avatar.Unlike most talking avatars in which the speech-synchronized facial animation is generated offline,this talking avatar is able to speak with live speech input.This life-like talking avatar has many potential applications in videophones,virtual conferences,audio/video chats and entertainment.Since phonemes are the smallest units of pronunciation,a real-time phoneme recognizer was built.The synchronization between the input live speech and the facial motion used a phoneme recognition and output algorithm.The coarticulation effects are included in a dynamic viseme generation algorithm to coordinate the facial animation parameters(FAPs) from the input phonemes.The MPEG-4 compliant avatar model is driven by the generated FAPs.Tests show that the avatar motion is synchronized and natural with MOS values of 3.42 and 3.5.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2011年第9期1180-1186,共7页 Journal of Tsinghua University(Science and Technology)
基金 国家自然科学基金青年基金资助项目(60802085) 国家自然科学基金面上项目(61175018) 陕西省科技计划青年科技新星项目(2011KJXX29) 陕西省自然科学基础研究计划(2011JM8009)
关键词 可视语音合成 虚拟说话人 面部动画 visual speech synthesis talking avatar facial animation
  • 相关文献

参考文献10

  • 1Cosatto E, Ostermann J, Graf H P, et al. Lifelike talking faces for interactive services [J].Proceedings of the IEEE, 2003, 91(9) : 1406 - 1429.
  • 2TANG Hao, FU Yun, TU Jilin, et al. Humanoid audio-visual avatar with emotive text-to-speech synthesis [J]. IEEE Transactions on Multimedia, 2008, 10(6) : 969 -981.
  • 3WU Zhiyong, ZHANG Shen, CAI Lianhong, et al. Real-time Synthesis of Chinese Visual Speech and Facial Expressions using MPEG-4 FAP Features in a Three-dimensional Avatar [C]//The International Conference on Spoken Language Processing, Pittsburgh, 2006 : 1802-1805.
  • 4Pandzic I S, Forehheimer R. MPEG-4 Facial Animation [M]. New York: Wiley, 2002.
  • 5HUANG Fujie, Cosatto E, Graf H. Triphone based units election for concatenative visual speech synthesis [C]// IEEE international conference on acoustics, speech, and signal processing. NJ: IEEE Press, 2002: 2037- 2040.
  • 6Brand M. Voice puppetry [C]// Proceedings of the SIGGRAPH'99. NY: ACMPress, 1999:21-28.
  • 7XIE Lei, LIU Zhiqiang. Realistic mouth synching for speech-driven talking face using articulatory modeling [J]. IEEE Transactions on Multimedia, 2007, 9(3) : 500 - 510.
  • 8王志明,蔡莲红.动态视位模型及其参数估计[J].软件学报,2003,14(3):461-466. 被引量:8
  • 9Young S, Evermann G, Kershaw D, et al. The HTK Book [M]. Cambridge University Engineering Department, 2009.
  • 10王理嘉,林焘.语音学教程[M].北京大学出版社,1992.

二级参考文献10

  • 1[1]Cohen MM, Massaro DW. Modeling coarticulation in synthetic visual speech. In: Thalmann NM, Thalmann D, eds. Models Techniques in Computer Animation. Tokyo: Springer-Verlag, 1993. 139~156.
  • 2[2]Reveret L, Bailly G, Badin P. Mother: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. In: Yuan Bao-Zong, Huang Tai-Yi, Tang Xiao-Fang, eds. Proceedings of the 6th International Conference on Spoken Language Processing (Ⅱ). Beijing: China Military Friendship Publish, 2000. 755~758.
  • 3[3]Brooke NM, Scott SD. Computer graphics animations of talking faces based on stochastic models. In: International Symposium on Speech, Image Processing and Neural Networks. 1994. 73~76.
  • 4[4]Masuko T, Kobayashi T, Tamura M. Text-to-Visual speech synthesis based on parameter generation from HMM. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (Ⅵ). 1998. 3745~3748.
  • 5[5]Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the ACM SIGGRAPH Conference on Computer Graphics. 1997. 353~360.
  • 6[6]Cosatto E, Potamianos G, Graf HP. Audio-Visual unit selection for the synthesis of photo-realistic talking-heads. In: IEEE International Conference on Multimedia and Expo (Ⅱ). 2000. 619~622.
  • 7[7]Steve M, Andrew B. Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. In: Yuan BZ, Huang TY, Tang XF, eds. Proceedings of the 6th International Conference on Spoken Language Processing (Ⅱ). Beijing: China Military Friendship Publish, 2000. 759~762.
  • 8[8]International Standard. Information technology-coding of audio-visual objects (Part 2). Visual; Admendment 1: Visual extensions, ISO/IEC 14496-2: 1999/Amd.1:2000(E).
  • 9[9]Zhong J, Olive J. Cloning synthetic talking heads. In: Proceedings of the 3rd ESCA/COCOSDA Workshop on Speech Synthesis. 1998. 26~29.
  • 10[10]Le Goff B, Benoit C. A text-to-audiovisual-speech synthesizer for French. In: Proceedings of the 4th International Conference on Spoken Language Processing (Ⅳ). 1996. 2163~2166.

共引文献7

同被引文献17

  • 1王志明,蔡莲红,艾海舟.基于数据驱动方法的汉语文本-可视语音合成(英文)[J].软件学报,2005,16(6):1054-1063. 被引量:16
  • 2张申.虚拟说话人可视表现力的研究[D].北京:清华大学,2011.
  • 3Feldman R S. Fundamentals of Nonverbal Behavior [M]. Cambridge, UK : Cambridge University Press, 1991.
  • 4Munhall K, Jones J, Callan D, et al. Visual prosody and speech intelligibility: Head movement improves auditory speech perception [J]. Psychological Science, 2004, 15(2): 133- 137.
  • 5McNeill D. Gesture and Thought [M]. Chicago, IL, USA, University of Chicago Press, 2005.
  • 6Pelachaud C, Badler N, Steedman M. Generating facial expressions for speech [J]. Cognitive Science : A Multidisciplinary Journal, 1996, 20(1) : 1 - 46.
  • 7ZHANG Shen, WU Zhiyong, Meng H, et al. Head movement synthesis based on semantic and prosodic features for a Chinese expressive avatar [C]// Proc ICASSP. Honolulu, HI, USA: IEEE Press, 2007: 837-840.
  • 8Busso C, Deng Z, Grimm M, et al. Rigid head motion in expressive facial animation: Analysis and synthesis [J]. IEEE Trans on Audio, Speech and Language Processing, 2006, 15(3): 1075-1086.
  • 9Sargin M E, Erzin E, Yemez Y, et al. Prosody-driven head-gesture animation [C]// Proc ICASSP. Honolulu, HI, USA: IEEE Press, 2007: 677- 680.
  • 10Hofer G, Shimodaira H. Automatic head motion prediction from speech data [C]//Proc Interspeech. Grenoble, France: ISCA, 2007.

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部