摘要
目的针对现有动态三维数字人体模型生成时不能改变体型、运动固定单一等问题,提出一种融合变分自编码器(variational auto-encoder,VAE)网络、对比语言—图像预训练(contrastive language-image pretraining,CLIP)网络与门控循环单元(gate recurrent unit,GRU)网络生成运动三维人体模型的方法。该方法可根据文本描述生成相应体型和动作的三维人体模型。方法首先,使用VAE编码网络生成潜在编码,结合CLIP网络零样本生成体型与文本表述相符的人体模型,以解决蒙皮多人线性(skinned multi-person linear,SMPL)模型参数不合理而生成不符合正常体型特征的人体模型问题;其次,采用VAE网络与GRU网络生成与文本表述相符的变长时间三维人体姿势序列,以解决现有运动生成方法仅生成事先指定的姿势序列、无法生成运动时间不同的姿势序列问题;最后,将体型特征与运动特征结合,得到三维运动人体模型。结果在HumanML3D数据集上进行人体生成实验,并与其他3种方法进行比较,相比于现有最好方法,R精度的Top1、Top2和Top3分别提高了0.031、0.034和0.028,弗雷歇初始距离(Fréchet inception distance,FID)提高了0.094,多样性提高了0.065。消融实验验证了模型的有效性,结果表明本文方法对人体模型生成效果有提升。结论本文方法可通过文本描述生成运动三维人体模型,模型的体型和动作更符合输入文本的描述。
Objective Artificial intelligence generated content(AIGC)technology can reduce the workload of threedimensional(3D)modeling when applied to generate virtual 3D scene models using natural language.For static 3D objects,methods have arisen in generating high-precision 3D models that match a given textual description.By contrast,for dynamic digital human body models,which is also highly popular in numerous circumstances,only two-dimensional(2D)human images or sequences of human poses can be generated corresponding to a given textual description.Dynamic 3D human models cannot be generated with the same way above using natural language.Moreover,current existing meth⁃ods can lead to problems such as immutable shape and motion when generating dynamic digital human models.A method fusing variational auto-encoder (VAE), contrastive language-image pretraining (CLIP), and gate recurrent unit (GRU),which can be used to generate satisfactory dynamic 3D human models corresponding to the shapes and motions described bythe text, is proposed to address the above problems. Method A method based on the VAE network is proposed in this paperto generate dynamic 3D human models, which correspond to the body shape and action information described in the text.Notably, a variety of pose sequences with variable time duration can be generated with the proposed method. First, theshape information of the body is obtained through the body shape generation module based on the VAE network and CLIPmodel, and zero-shot samples are used to generate the skinned multi-person linear (SMPL) parametric human model thatmatches the textual description. Specifically, the VAE network encodes the body shape of the SMPL model, the CLIPmodel matches the textual descriptions and body shapes, and the 3D human model with the highest matching score is thusfiltered. Second, variable-length 3D human pose sequences are generated through the body action generation module basedon the VAE and GRU networks that match the textual description. Particularly, the VAE self-encoder encodes the dynamichuman poses. The action length sampling network then obtains the length of time that matches the textual description of theaction. The GRU and VAE networks encode the input text and generate the diverse dynamic 3D human pose sequencesthrough the decoder. Finally, a dynamic 3D human model corresponding to the body shape and action description can begenerated by fusing the body shape and action information generated above. The performance of the method is evaluated inthis paper using the HumanML3D dataset, which comprises 14 616 motions and 44 970 linguistic annotations. Some of themotions in the dataset are mirrored before training, and some words are replaced in the motion descriptions (e. g. ,“left” ischanged to “right”) to expand the dataset. In the experiments in this paper, the HumanML3D dataset is divided into train⁃ing, testing, and validation sets in the ratios of 80%,15%, and 5%, respectively. The experiments in this paper are con⁃ducted in an Ubuntu 18. 04 environment with a Tesla V100 GPU and 16GB of video memory. The adaptive moment estima⁃tion (Adam) optimizer is trained in 300 training rounds with a learning rate of 0. 000 1 and a batch size of 128 to train themotion self-encoder. The Adam optimizer performs 320 training rounds with a learning rate of 0. 000 2 and a batch size of32 to train the motion generator. This optimizer also performs 200 training rounds with a learning rate of 0. 000 1 and abatch size of 64 for training the motion length network. Result Dynamic 3D human model generation experiments were con⁃ducted on the HumanML3D dataset. Compared with three other state-of-the-art methods, the proposed method shows animprovement of 0. 031,0. 034, and 0. 028 in the Top1, Top2, and Top3 dimensions of R-precision,0. 094 in Fréchetinception distance(FID), and 0. 065 in diversity, respectively, considering the best available results. The experimentalanalysis for qualitative evaluation was divided into three parts: body shape feature generation, action feature generation,and dynamic 3D human model generation including body features. The body feature generation part was tested using differ⁃ent text descriptions (e. g. , tall, short, fat, thin). For the action feature generation part, the same text descriptions aretested using this paper and other methods for generation comparison. Combining the body shape features and the action fea⁃ture of the human body, the generation of dynamic 3D human models with body shape features is demonstrated. In addi⁃tion, ablation experiments, including ablation comparison with different methods using different loss functions, are per⁃formed to further demonstrate the effectiveness of the method. The final experimental results show that the proposed methodin this paper improves the effectiveness of the model. Conclusion This paper presents methods for generating dynamic 3Dhuman models that conform to textual descriptions, fusing body shape and action information. The body shape generationmodule can generate SMPL parameterized human models whose body shape conforms to the textual description, while theaction generation module can generate variable-length 3D human pose sequences that match the textual description. Experi⁃mental results show that the proposed method can effectively generate motion dynamic 3D human models that conform totextual descriptions, and the generated human models have diverse body shape and motions. On the HumanML3D dataset,the performance of the method outperforms other similar state-of-the-art algorithms.
作者
李健
杨钧
王丽燕
王永归
Li Jian;Yang Jun;Wang Liyan;Wang Yonggui(School of Electronic Information and Artificial Intelligence,Shaanxi University of Science and Technology,Xi’an 710021,China;School of Art and Sciences,Shaanxi University of Science and Technology,Xi’an 710021,China)
出处
《中国图象图形学报》
CSCD
北大核心
2024年第5期1434-1446,共13页
Journal of Image and Graphics
基金
陕西科技大学2021年教育信息化教学改革项目(JXJG2021-09)。