多模态信息引导的三维数字人运动生成综述

A survey on multimodal information-guided 3D human motion generation

导出

摘要基于多模态信息的三维数字人运动生成技术旨在通过文本、音频、图像和视频等数据实现特定输入条件下的人体运动生成。这项技术在电影、动画、游戏制作和元宇宙等领域具有重要的应用价值和广泛的经济社会效益,是近年来计算机图形学和计算机视觉等领域研究的热点问题之一。然而,基于多模态信息的三维数字人运动生成面临着诸多挑战,包括跨模态信息的表征和融合困难、高质量数据集缺乏、生成的运动质量较差(如抖动、穿模和脚部滑动等)以及生成效率低等问题。虽然近年来研究者们提出了各式各样的解决方案来应对上述挑战,但如何根据不同模态数据的特点实现高效、高质量的三维数字人运动生成仍然是一个开放性问题。本文以数字人运动生成所采用的模型架构为分类标准,将现有的主流方法分为基于生成对抗网络(generative adversarial network,GAN)的方法、基于自编码器(autoencoder,AE)的方法、基于变分自编码器(variational autoencoder,VAE)的方法以及基于扩散模型的方法,总结并形成了一种数字人运动生成通用框架。本文还介绍了该领域常见的参数化人体模型、数据集以及评估指标。对于一些具有代表性的工作,本文在一些常用数据集上进行了对比实验,评估这些方法的性能表现。最后综合现有的数据集、算法和代表性研究,总结了该领域的问题和挑战,探讨了完善数据集、优化运动质量和多样性、融合跨模态信息和提高生成效率等潜在的研究方向。 Three-dimensional(3D)digital human motion generation guided by multimodal information generates human motion under specific input conditions through data,such as text,audio,image,and video.This technology has a wide spectrum of applications and extensive economic and social benefits in the fields of film,animation,game production,metaverse,etc.,and is one of the research hotspots in the fields of computer graphics and computer vision.However,such a task faces grand challenges,including the difficult representation and fusion of multimodal information,lack of highquality datasets,poor quality of generated motion(such as jitter,penetration,and foot sliding),and low generation effi⁃ciency.Although various solutions have been proposed to address the aforementioned challenges,a mechanism for achiev⁃ing efficient and high-quality 3D digital human motion generation based on the characteristics of distinct modal data remains an open problem to be solved.This paper comprehensively reviews 3D digital human motion generation and elabo⁃rates on related recent advances from the perspectives of parametrized 3D human models,human motion representation,motion generation techniques,motion analysis and editing,existing human motion datasets and evaluation metrics.Param⁃etrized human models facilitate digital human modeling and motion generation through the provision of parameters associ⁃ated with body shapes and postures and serve as key pillars of current digital human research and applications.This survey begins with an introduction to widely used parametrized 3D human body models,including shape completion and animation of people(SCAPE),skinned multi-person linear model(SMPL),SMPL-X,and SMPL-H,and their detailed comparison in terms of model representations and the parameters used to control body shapes,poses,and facial expressions.Human motion representation is a core issue in digital human motion generation.This work highlights the musculoskeletal model and classic skinning algorithms,including linear blending skinning and dual quaternion skinning,and their application in physics-based and data-driven methods to control human movements.We have also extensively studied approaches to exist⁃ing multimodal information-guided human motion generation and categorized them into four major branches,i.e.,genera⁃tive adversarial network-,autoencoder-,variational autoencoder-,and diffusion model-based methods.Other works,such as generative motion matching,have also been mentioned and compared with data-driven methods.The survey summarizes existing schemes of human motion generation from the perspectives of methods and model architectures and presents a uni⁃fied framework for the generation of digital human motion.A motion encoder extracts motion features from an original motion sequence and fuses them with the conditional characteristics extracted by the conditional encoder into latent vari⁃ables or maps them to the latent space.This condition enables generative adversarial networks,autoencoders,variational autoencoders,or diffusion models to generate qualified human movements through a motion decoder.In addition,this paper surveys the current work on digital human motion analysis and editing,including motion clustering,motion predic⁃tion,motion in-betweening,and motion in-filling.Data-driven human motion generation and evaluation requires the use of a high-quality dataset.We collected publicly available human motion databases and classified them into various types based on two criteria.From the perspective of data type,existing databases can be classified into motion capture and video reconstruction datasets.Motion capture data sets rely on devices,such as motion capture systems,cameras,and inertial measurement units,to obtain real human movement data(i.e.,ground truth).Meanwhile,the video reconstruction data⁃set was used to reconstruct a 3D human body model through estimation of body joints from motion videos and fitting them to a parametric human body model.From the perspective of task type,commonly used databases can be classified into text-,action-,and audio-motion datasets.The new datasets are usually obtained by processing motion capture and video recon⁃struction datasets based on specific tasks.A comprehensive briefing on the evaluation metrics of 3D human motion genera⁃tion,including motion quality,motion diversity,and multimodality,consistency between inputs and outputs,˙˙and infer⁃ence efficiency,is also provided.Apart from objective evaluation metrics,user study was employed to generate human motion quality and was discussed in this paper.To compare the performances of various generation methods used in digital human motion on public datasets,we selected a collection of the most representative work and carried out extensive experi⁃ments for comprehensive evaluation.Finally,the well-addressed and underexplored issues in this field were summarized,and several potential further research directions regarding datasets,the quality and diversity of generated motions,crossmodal information fusion,and generation efficiency were discussed.Specifically,existing datasets generally fail to meet the expectations concerning motion diversity and descriptions associated with motions,data distribution,and length of motion sequence.Future work should consider the development of a large-scale 3D human motion database to boost the effi⁃cacy and robustness of motion generation models.In addition,the quality of generated human motions,especially those with complex movement patterns,remains dissatisfactory.Physical constraints and postprocessing show promise in the inte⁃gration into human motion generation frameworks to tackle issues.In addition,although human-motion generation methods can generate various motion sequences from multimodal information,such as text,audio,music,actions and keyframes,work on cross-modal human motion generation(e.g.,generating a motion from a text description and a piece of back⁃ground music)is scarcely reported.Investigation of such a task is worthy,especially in unlocking new opportunities in this area.In terms of the diversity of generated content,some researchers have explored harvesting rich,diverse,and stylized motions using variational autoencoders,diffusion models,and contrastive language-image pretraining neural networks.However,current studies mainly focus on the motion generation of a single human represented by an SMPL-like naked parameterized 3D model.Meanwhile,the generation and interaction of multiple dressed humans have huge untapped appli⁃cation potential but have not received sufficient attention.Finally,another nonnegligible issue is a mechanism for boosting motion generation efficiency and achieving a good balance between quality and inference overhead.Possible solutions to such a problem include lightweight parameterized human models,information-intensive training datasets,and improved or more advanced generative frameworks.

作者赵宝全付一愉苏卓王若梅吕辰雷罗笑南 Zhao Baoquan;Fu Yiyu;Su Zhuo;Wang Ruomei;Lyu Chenlei;Luo Xiaonan(School of Artificial Intelligence,Sun Yat-sen University,Zhuhai 519000,China;School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510006,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060,China;School of Computer and Information Security,Guilin University of Electronic Science and Technology,Guilin 541004,China)

机构地区中山大学人工智能学院中山大学计算机学院深圳大学计算机与软件学院桂林电子科技大学计算机与信息安全学院

出处《中国图象图形学报》 CSCD 北大核心 2024年第9期2541-2565,共25页 Journal of Image and Graphics

基金国家重点研发计划资助(2022YFF0903103) 广东省自然科学基金项目(2023A1515011639) 中央高校基本科研业务费专项资金资助(23xkjc019,24qnpy145)。

关键词三维数字人运动生成多模态信息参数化人体模型生成对抗网络(GAN) 自编码器(AE) 变分自编码器(VAE) 扩散模型 3D avatar motion generation multimodal information parametric human model generative adversarial net⁃work(GAN) autoencoder(AE) variational autoencoder(VAE) diffusion model

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

16年打磨,他用游戏让中国文化“西游”[J].中国家庭医生,2024(18):5-5.
2关聪,孙嫣然.《黑神话:悟空》出圈效应[J].财新周刊,2024(36):62-67.
3李辉,马潇曼,孙凡,李蕊,潘祥.从身体运动到体育与健康学科核心素养:生成要义与路径[J].沈阳体育学院学报,2024,43(5):22-28.
4向辉.信息科技筑梦人[J].中国信息技术教育,2024(17):40-41.
5罗琼,程亮,黄玉芳,李波.太极操对改善早产儿扭动阶段全身运动质量的疗效观察[J].江西医药,2024,59(7):645-647.
6齐博.云计算技术在计算机网络安全存储中的应用分析[J].数字技术与应用,2024,42(8):159-161.
7孙欣.开放性问题设计的原则与方法的研究[J].数学教学通讯,2024(26):67-69.
8包天旭,贾文川.基于Azure Kinect运动捕捉的仿人机器人运动控制平台设计[J].计量与测试技术,2024,51(9):61-64.
9曹紫艳.问题引领思维,促成深度学习[J].中学数学,2024(20):60-61.
10白羽,周紫昕.用善于发现的眼睛寻找“新星”[J].成才与就业,2024(9):11-12.

中国图象图形学报

2024年第9期

浏览历史

内容加载中请稍等...

多模态信息引导的三维数字人运动生成综述

相关作者

相关机构

相关主题

浏览历史