期刊文献+

多模态数字人建模、合成与驱动综述

Multi-modal digital human modeling,synthesis,and driving:a survey
原文传递
导出
摘要 多模态数字人是指具备多模态认知与交互能力,且有类人的思维和行为逻辑的真实自然虚拟人。近年来随着计算机视觉与自然语言处理等领域的交叉融合以及蓬勃发展,相关技术取得显著进步。本文讨论在图形学和视觉领域比较重要的多模态人头动画、多模态人体动画以及多模态数字人形象构建3个主题,介绍其方法论和代表工作。在多模态人头动画主题下介绍语音驱动人头和表情驱动人头两个问题的相关工作。在多模态人体动画主题下介绍基于循环神经网络(recurrent neural networks,RNN)的、基于Transformer的和基于降噪扩散模型的人体动画生成。在多模态数字人形象构建主题下介绍视觉语言相似性引导的虚拟形象构建、基于多模态降噪扩散模型引导的虚拟形象构建以及三维多模态虚拟人生成模型。本文将相关方向的代表性工作进行介绍和归类,对已有方法进行总结,并展望未来可能的研究方向。 A multimodal digital human refers to a digital avatar that can perform multimodal cognition and interaction and should be able to think and behave like a human being.Substantial progress has been made in related technologies due to cross-fertilization and vibrant development in various fields,such as computer vision and natural language processing.This article discusses three major themes in the areas of computer graphics and computer vision:multimodal head animation,multimodal body animation,and multimodal portrait creation.The methodologies and representative works in these areas are also introduced.Under the theme of multimodal head animation,this work presents the research on speech-and expression-driven head models.Under the theme of multimodal body animation,the paper explores techniques involving recurrent neural network(RNN)-,Transformer-,and denoising diffusion probabilistic model(DDPM)-based body animation.The discussion of multimodal portrait creation covers portrait creation guided by visual-linguistic similarity,portrait creation guided by multimodal denoising diffusion model,and three-dimensional(3D)multimodal generative models on digital portraits.Further,this article provides an overview and classification of representative works in these research directions,summarizes existing methods,and points out potential future research directions.This article delves into key directions in the field of multimodal digital humans and covers multimodal head animation,multimodal body animation,and the construction of multimodal digital human representations.In the realm of multimodal head animation,we extensively explore two major tasks:expression-and speech-driven animation.For explicit and implicit parameterized models for expression-driven head animation,mesh surfaces and neural radiance fields(NeRF)are used to improve the rendering effects.Explicit models employ 3D morphable and linear models but encounter challenges,such as weak expressive capac⁃ity,nondifferentiable rendering,and difficult modeling of personalized features.By contrast,implicit models,especially those based on NeRF,demonstrate superior expressive capacity and realism.In the domain of speech-driven head anima⁃tion,we review 2D and 3D methods,with a particular focus on the important advantages of NeRF technology in enhancing realism.2D speech-driven head video generation utilizes techniques,such as generative adversarial networks and image transfer,but depends on 3D prior knowledge and structural characteristics.On the other hand,methods using NeRF,such as audio driven NeRF for talking head synthesis(AD-NeRF)and semantic-aware implicit neural audio-driven video por⁃trait generation(SSP-NeRF),achieve end-to-end training with differentiable NeRF.This condition substantially improves rendering realism while still addressing challenges associated with slow training and inference speeds.Multimodal body ani⁃mation focuses on speech-driven body animation,music-driven dance,and text-driven body animation.We focus on the importance of learning speech semantics and melody and discuss the applications of RNN,Transformer,and denoising dif⁃fusion models in this field.Transformer gradually replaces RNN as the mainstream model,which gains notable advantages in sequence signal learning through attention mechanisms.We also highlight the body animation generation based on denoising diffusion models,such as free-form language-based motion synthesis and editing(FLAME),motion diffusion model(MDM),and text-driven human motion generation with diffusion model(MotionDiffuse),and multimodal denoising networks under music and text conditions.In the realm of the construction of multimodal digital human representations,the article emphasizes virtual-image construction guided by visual-language similarity and denoising of diffusion models.In addition,the demand for large-scale,diverse datasets in digital human representation construction is addressed to foster powerful and universal generative models.The three key aspects of multimodal digital humans are systematically explored:head animation,body animation,and digital human representation construction.In summary,explicit head models,although simple,editable,and computationally efficient,lack expressive capacity,and face challenges in rendering,espe⁃cially in modeling facial personalization and nonfacial regions.By contrast,implicit models,especially those using NeRF,demonstrate stronger modeling capabilities and realistic rendering effects.In the realm of speech-driven animation,NeRFbased solutions for head animation overcome the limitations of 2D speaker and 3D digital head animation and achieve more natural and realistic speaker videos.Regarding body animation models,Transformer gradually replaces RNN,whereas denoising diffusion models can be used to potentially address mapping challenges in multimodal body animation.Finally,digital human representation construction faces challenges,with visual-language similarity and denoising diffusion model guidance showing promising results.However,the difficulty lies in the direct construction of 3D multimodal virtual humans due to the lack of sufficient 3D virtual human datasets.This study comprehensively analyzes various issues and provides clear directions and challenges for future research.In conclusion,should focus on future developments in multimodal digi⁃tal humans.Key directions include improvement of 3D modeling and real-time rendering accuracy,integration of speechdriven and facial expression synthesis,construction of large and diverse datasets,exploration of multimodal information fusion and cross-modal learning,and addressing ethical and social impacts.Implicit representation methods,such as neu⁃ral volume rendering,are crucial for improved 3D modeling.Simultaneously,the construction of larger datasets poses a for⁃midable challenge in the development of robust and universal generative models.Exploration of multimodal information fusion and cross-modal learning allows models to learn from diverse data sources and present a range of behaviors and expressions.Attention to ethical and social impacts,including digital identity and privacy,is crucial.Such research direc⁃tions should serve as guide the field toward a comprehensive,realistic,and universal future,with profound influence on interactions in virtual spaces.
作者 高玄 刘东宇 张举勇 Gao Xuan;Liu Dongyu;Zhang Juyong(Key Laboratory of Computer Graphics and Perception Interaction in Anhui Province,University of Science and Technology of China,Hefei 230026,China)
出处 《中国图象图形学报》 CSCD 北大核心 2024年第9期2494-2512,共19页 Journal of Image and Graphics
基金 国家自然科学基金项目(62122071,62272433)。
关键词 虚拟数字人建模 多模态角色动画 多模态生成与编辑 神经渲染 生成模型 神经隐式表示 virtual human modeling multimodal character animation multimodal generation and editing neural render⁃ing generative models neural implicit representation
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部