融入变分自编码网络的文本生成三维运动人体

Incorporating variational auto-encoder networks for text-driven generation of 3D motion human body

导出

摘要目的针对现有动态三维数字人体模型生成时不能改变体型、运动固定单一等问题,提出一种融合变分自编码器(variational auto-encoder,VAE)网络、对比语言—图像预训练(contrastive language-image pretraining,CLIP)网络与门控循环单元(gate recurrent unit,GRU)网络生成运动三维人体模型的方法。该方法可根据文本描述生成相应体型和动作的三维人体模型。方法首先,使用VAE编码网络生成潜在编码,结合CLIP网络零样本生成体型与文本表述相符的人体模型,以解决蒙皮多人线性(skinned multi-person linear,SMPL)模型参数不合理而生成不符合正常体型特征的人体模型问题;其次,采用VAE网络与GRU网络生成与文本表述相符的变长时间三维人体姿势序列,以解决现有运动生成方法仅生成事先指定的姿势序列、无法生成运动时间不同的姿势序列问题;最后,将体型特征与运动特征结合,得到三维运动人体模型。结果在HumanML3D数据集上进行人体生成实验,并与其他3种方法进行比较,相比于现有最好方法,R精度的Top1、Top2和Top3分别提高了0.031、0.034和0.028,弗雷歇初始距离(Fréchet inception distance,FID)提高了0.094,多样性提高了0.065。消融实验验证了模型的有效性,结果表明本文方法对人体模型生成效果有提升。结论本文方法可通过文本描述生成运动三维人体模型,模型的体型和动作更符合输入文本的描述。 Objective Artificial intelligence generated content(AIGC)technology can reduce the workload of threedimensional(3D)modeling when applied to generate virtual 3D scene models using natural language.For static 3D objects,methods have arisen in generating high-precision 3D models that match a given textual description.By contrast,for dynamic digital human body models,which is also highly popular in numerous circumstances,only two-dimensional(2D)human images or sequences of human poses can be generated corresponding to a given textual description.Dynamic 3D human models cannot be generated with the same way above using natural language.Moreover,current existing meth⁃ods can lead to problems such as immutable shape and motion when generating dynamic digital human models.A method fusing variational auto-encoder (VAE), contrastive language-image pretraining (CLIP), and gate recurrent unit (GRU),which can be used to generate satisfactory dynamic 3D human models corresponding to the shapes and motions described bythe text, is proposed to address the above problems. Method A method based on the VAE network is proposed in this paperto generate dynamic 3D human models, which correspond to the body shape and action information described in the text.Notably, a variety of pose sequences with variable time duration can be generated with the proposed method. First, theshape information of the body is obtained through the body shape generation module based on the VAE network and CLIPmodel, and zero-shot samples are used to generate the skinned multi-person linear (SMPL) parametric human model thatmatches the textual description. Specifically, the VAE network encodes the body shape of the SMPL model, the CLIPmodel matches the textual descriptions and body shapes, and the 3D human model with the highest matching score is thusfiltered. Second, variable-length 3D human pose sequences are generated through the body action generation module basedon the VAE and GRU networks that match the textual description. Particularly, the VAE self-encoder encodes the dynamichuman poses. The action length sampling network then obtains the length of time that matches the textual description of theaction. The GRU and VAE networks encode the input text and generate the diverse dynamic 3D human pose sequencesthrough the decoder. Finally, a dynamic 3D human model corresponding to the body shape and action description can begenerated by fusing the body shape and action information generated above. The performance of the method is evaluated inthis paper using the HumanML3D dataset, which comprises 14 616 motions and 44 970 linguistic annotations. Some of themotions in the dataset are mirrored before training, and some words are replaced in the motion descriptions (e. g. ,“left” ischanged to “right”) to expand the dataset. In the experiments in this paper, the HumanML3D dataset is divided into train⁃ing, testing, and validation sets in the ratios of 80%,15%, and 5%, respectively. The experiments in this paper are con⁃ducted in an Ubuntu 18. 04 environment with a Tesla V100 GPU and 16GB of video memory. The adaptive moment estima⁃tion (Adam) optimizer is trained in 300 training rounds with a learning rate of 0. 000 1 and a batch size of 128 to train themotion self-encoder. The Adam optimizer performs 320 training rounds with a learning rate of 0. 000 2 and a batch size of32 to train the motion generator. This optimizer also performs 200 training rounds with a learning rate of 0. 000 1 and abatch size of 64 for training the motion length network. Result Dynamic 3D human model generation experiments were con⁃ducted on the HumanML3D dataset. Compared with three other state-of-the-art methods, the proposed method shows animprovement of 0. 031,0. 034, and 0. 028 in the Top1, Top2, and Top3 dimensions of R-precision,0. 094 in Fréchetinception distance(FID), and 0. 065 in diversity, respectively, considering the best available results. The experimentalanalysis for qualitative evaluation was divided into three parts: body shape feature generation, action feature generation,and dynamic 3D human model generation including body features. The body feature generation part was tested using differ⁃ent text descriptions (e. g. , tall, short, fat, thin). For the action feature generation part, the same text descriptions aretested using this paper and other methods for generation comparison. Combining the body shape features and the action fea⁃ture of the human body, the generation of dynamic 3D human models with body shape features is demonstrated. In addi⁃tion, ablation experiments, including ablation comparison with different methods using different loss functions, are per⁃formed to further demonstrate the effectiveness of the method. The final experimental results show that the proposed methodin this paper improves the effectiveness of the model. Conclusion This paper presents methods for generating dynamic 3Dhuman models that conform to textual descriptions, fusing body shape and action information. The body shape generationmodule can generate SMPL parameterized human models whose body shape conforms to the textual description, while theaction generation module can generate variable-length 3D human pose sequences that match the textual description. Experi⁃mental results show that the proposed method can effectively generate motion dynamic 3D human models that conform totextual descriptions, and the generated human models have diverse body shape and motions. On the HumanML3D dataset,the performance of the method outperforms other similar state-of-the-art algorithms.

作者李健杨钧王丽燕王永归 Li Jian;Yang Jun;Wang Liyan;Wang Yonggui(School of Electronic Information and Artificial Intelligence,Shaanxi University of Science and Technology,Xi’an 710021,China;School of Art and Sciences,Shaanxi University of Science and Technology,Xi’an 710021,China)

机构地区陕西科技大学电子信息与人工智能学院陕西科技大学文理学院

出处《中国图象图形学报》 CSCD 北大核心 2024年第5期1434-1446,共13页 Journal of Image and Graphics

基金陕西科技大学2021年教育信息化教学改革项目(JXJG2021-09)。

关键词人体动作合成自然语言处理(NLP) 深度学习蒙皮多人线性模型变分自编码器网络 human motion synthesis natural language processing(NLP) deep learning skinned multi-person linear model variational auto-encoder network

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1姚砺,张幼安,张梦雪,万燕.关节轴角先验对3维人体重建结果的影响[J].中国图象图形学报,2021,26(12):2918-2930. 被引量：3
2杨航,陈瑞,安仕鹏,魏豪,张衡.深度学习背景下的图像三维重建技术进展综述[J].中国图象图形学报,2023,28(8):2396-2409. 被引量：13

二级参考文献5

1韩凯,庞宗强,王龙,岳东.基于深度扫描仪的高辨识度三维人体模型重建方法[J].图学学报,2015,36(4):503-510. 被引量：13
2吴博剑,黄惠.透明物体的三维重建综述[J].计算机辅助设计与图形学学报,2020,32(2):173-180. 被引量：5
3郑太雄,黄帅,李永福,冯明驰.基于视觉的三维重建关键技术研究综述[J].自动化学报,2020,46(4):631-652. 被引量：108
4龙霄潇,程新景,朱昊,张朋举,刘浩敏,李俊,郑林涛,胡庆拥,刘浩,曹汛,杨睿刚,吴毅红,章国锋,刘烨斌,徐凯,郭裕兰,陈宝权.三维视觉前沿进展[J].中国图象图形学报,2021,26(6):1389-1428. 被引量：33
5宋巍,朱孟飞,张明华,赵丹枫,贺琪.基于深度学习的单目深度估计技术综述[J].中国图象图形学报,2022,27(2):292-328. 被引量：8

共引文献14

1王江安,黄乐,庞大为,秦林珍,梁温茜.基于自适应聚合循环递归的稠密点云重建网络[J].图学学报,2024,45(1):230-239. 被引量：1
2康志忠,杨俊涛.室内实景三维重建技术综述[J].时空信息学报,2024,31(1):1-10.
3陈前,王敏,任健,商庆新.三维重建技术在舌诊中的应用与挑战[J].山东中医药大学学报,2024,48(3):375-380.
4叶森辉,王蕾.结合局部自注意力和深度优化的多视图重建[J].计算机与现代化,2024(5):92-98.
5王一川,王家奎,熊伦,余子洋.基于结构光相机与隐式表面重建的室内建模[J].武汉工程大学学报,2024,46(3):317-324.
6徐苗,马娜.基于NURBS曲面的小麦叶片三维重建[J].中国农学通报,2024,40(25):140-146.
7茅靖峰,丁寅佳,李奔,徐一鸣.基于HALCON的机器视觉识别综合实验项目设计[J].中国教育技术装备,2024(18):132-137.
8陈鸿鹄,陶云帆,张举勇.三维穿衣人体重建综述——从传统方法到高保真模型[J].中国图象图形学报,2024,29(9):2566-2595.
9肖强,陈铭林,张晔,黄小红.室内稀疏全景图的神经辐射场重建[J].中国图象图形学报,2024,29(9):2596-2609.
10高楠,王鹏程,刘泽圳,倪育博,孟召宗,张宗华.透明物体非侵入式三维重建方法综述(特邀)[J].红外与激光工程,2024,53(9):88-106.

1刘劲,徐玉豪,尤伟,陈晓,张子军,马辛.面向天文多普勒差分测速的太阳/行星光谱对生成方法[J].宇航学报,2024,45(2):273-282.
2陈燕,赖宇斌,肖澳,廖宇翔,陈宁江.基于CLIP和交叉注意力的多模态情感分析模型[J].郑州大学学报（工学版）,2024,45(2):42-50.
3白帆.为什么大家都认可HLTV评选的TOP20?[J].电子竞技,2024(2):91-93.
4任月冬,游新冬,滕尚志,吕学强.基于预训练模型的单帧航拍图像无监督语义分割[J].北京信息科技大学学报（自然科学版）,2024,39(2):21-28. 被引量：1
5张海翔,李培培,胡学钢.基于自适应密度邻域关系的多标签在线流特征选择[J].计算机技术与发展,2024,34(1):23-29.
6李鑫凯,王蒙.基于潜在编码空间的属性控制人脸图像翻译方法[J].微电子学与计算机,2024,41(4):85-95.
7职工好书月度排行榜·TOP3[J].中国工人,2023(12):68-68.
8潘玉华.21世纪以来人教版高中历史教科书文本表述变化研究——以“中国古代科学技术”为例[J].品位·经典,2024(3):46-48.
9溏心.人在社会上的形象预期[J].看天下,2023(33):93-93.
10张贤.浅析新媒体时代下的视觉传达设计[J].中文科技期刊数据库（全文版）经济管理,2016(9):181-181.

中国图象图形学报

2024年第5期

浏览历史

内容加载中请稍等...

融入变分自编码网络的文本生成三维运动人体

参考文献2

二级参考文献5

共引文献14

相关作者

相关机构

相关主题

浏览历史