融合多情感的语音驱动虚拟说话人生成方法

Multi-emotion driven virtual speaker generation method integrating multiple emotions

下载PDF

导出

摘要虚拟说话人生成是人工智能领域的一个重要研究方向,旨在通过计算机生成具有逼真语音的虚拟说话人。然而,现有方法往往忽视情绪表达、生成的人脸图像面部细节缺乏真实感,限制了虚拟说话人的表现能力和交互性。为解决这一问题,提出一种基于Transformer的生成对抗网络(generative adversarial network,GAN)方法,用于生成具有不同情绪的虚拟说话人(GANLTB)。该方法基于GAN架构,生成器采用Transformer模型处理语音和图像特征,结合情绪条件信息和潜在空间向量,生成带有指定情绪的语音和图像。判别器用于评估生成结果的真实性,并提供梯度信号指导生成器训练。通过引入双三次插值法,进一步提升了虚拟说话人生成的图像质量,使得虚拟说话人的面部细节更加清晰可见,表情更加自然和生动。使用情感多样性数据集(CREMA-D)验证了该方法,通过主观评估和客观指标,评估了生成的语音和图像的情绪表达能力和质量。实验结果表明,该方法能够生成具有多样化和逼真情绪表达的虚拟说话人。相比目前其他先进方法,所提方法在流畅度和逼真度等细节上都更加清晰,带来了更好的真实感。 Virtual speaker generation is an important research direction in the field of artificial intelligence,aiming to gene-rate virtual speakers with realistic voices through computers.However,existing methods often neglect emotional expression and the facial details of the generated face images lack realism,which limit the performance and interactivity of the virtual spea-kers.To address this issue,this paper proposed a Transformer-based generative adversarial network(GAN)method for generating virtual speakers with different emotions(GANLTB).This method was based on the GAN network architecture,where the generator used a Transformer model to process speech and image features,combined with emotional condition information and latent space vectors,generating voice and images with specified emotions.It used the discriminator to assess the authenticity of the generated results and provide gradient signals to guide the training of the generator.By introducing BiCubic interpolation,it further enhanced the image quality of the virtual speaker generation,making the facial details of the virtual speaker clearer and the expressions more natural and vivid.The method was validated using a diverse emotional dataset CREMA-D,through subjective evaluation and objective indicators to assess the emotional expression ability and quality of the generated speech and images.Experimental results show that the method can generate virtual speakers with diverse and realistic emotional expressions.Compared to other currently advanced methods,the proposed method is clearer in details such as fluency and realism,bringing a better sense of reality.

作者李帅帅何向真张跃洲王嘉欣 Li Shuaishuai;He Xiangzhen;Zhang Yuezhou;Wang Jiaxin(Key Laboratory of Linguistic&Cultural Computing Ministry of Education,Northwest Minzu University,Lanzhou 730030,China;Key Laboratory of Ethnic Language&Cultural Intelligent Information Processing,Northwest Minzu University,Lanzhou 730030,China)

机构地区西北民族大学语言与文化计算教育部重点实验室西北民族大学甘肃省民族语言文化智能信息处理重点实验室

出处《计算机应用研究》 CSCD 北大核心 2024年第8期2546-2553,共8页 Application Research of Computers

基金国家自然科学基金资助项目(62341209) 甘肃省教育教学成果培育项目(2023GSJXCGPY-60) 中央高校基本科研业务费专项资金资助项目(31920230054)。

关键词虚拟说话人生成对抗网络 TRANSFORMER 多情感表达语音驱动 virtual speaker GAN Transformer multi-emotion expression voice-driven

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1宋一飞,张炜,陈智能,姜育刚.数字说话人视频生成综述[J].计算机辅助设计与图形学学报,2023,35(10):1457-1468. 被引量：2
2年福东,王文涛,王妍,张晶晶,胡贵恒,李腾.基于关键点表示的语音驱动说话人脸视频生成[J].模式识别与人工智能,2021,34(6):572-580. 被引量：3
3陈凯,林珊玲,林坚普,林志贤,缪志辉,郭太良.基于Transformer人像关键点检测网络的研究[J].计算机应用研究,2023,40(6):1870-1875. 被引量：4
4陈贵强,何军,罗顺茺.基于改进CycleGAN的视频监控人脸超分辨率恢复算法[J].计算机应用研究,2021,38(10):3172-3176. 被引量：10

二级参考文献6

1Xian Wu,Kun Xu,Peter Hall.A Survey of Image Synthesis and Editing with Generative Adversarial Networks[J].Tsinghua Science and Technology,2017,22(6):660-674. 被引量：20
2王容,张永辉,张健,张帅岩.基于卷积神经网络的人脸超分辨率重建[J].计算机工程与设计,2019,40(9):2614-2619. 被引量：6
3张凯兵,郑冬冬,景军锋.低分辨人脸识别综述[J].计算机工程与应用,2019,55(22):14-24. 被引量：19
4刘君扬,王金凤.基于聚类框架与局部感受野的实时人脸疲劳检测[J].计算机应用研究,2020,37(12):3795-3798. 被引量：2
5南亚会,华庆一.遮挡人脸表情识别深度学习方法研究进展[J].计算机应用研究,2022,39(2):321-330. 被引量：5
6沈铖潇,钱丽萍,俞宁宁.融合UV位置图与CGAN的单图像大视角三维彩色人脸重建[J].计算机辅助设计与图形学学报,2022,34(4):614-622. 被引量：5

共引文献15

1涂鹏琦,高常鑫,桑农.基于深度神经网络的图像风格化方法综述[J].模式识别与人工智能,2022,35(4):333-347. 被引量：3
2孙红,宋冬豪,陈玉娟.融合高频滤波和伪影损失的人脸超分辨率重建[J].计算机应用研究,2023,40(6):1906-1911. 被引量：1
3刘安重.基于视频监控与PLC的选煤厂皮带机自动化控制系统设计[J].工业仪表与自动化装置,2023(4):18-22. 被引量：3
4纪佳奇,卢振坤,熊福棚,张甜,杨豪.基于多级跳跃残差组的运动人像去模糊网络[J].计算机应用,2023,43(10):3244-3250.
5钟卫华,张健,徐衡,邓羽丰.基于归一化流概率模型的水电机组异常声音检测[J].中国农村水利水电,2024(1):237-243.
6杨盼盼,马凌飞,平阳,索雅丽.移动AR+VR支持下跨媒体视频关键帧还原仿真[J].微型电脑应用,2024,40(3):32-36.
7石昌通,单鸿涛,郑光远,张玉金,刘怀远,宗智浩.改进视觉Transformer的视频插帧方法[J].计算机应用研究,2024,41(4):1252-1257.
8倪新龙,孙鹏,郎宇博,赵理夫,田天泽,周纯冰.面向视频侦查应用的退化人像GFP深度复原技术[J].刑事技术,2024,49(2):111-119.
9周玉蝶,张春燕,乔印虎,陈泽伟.改进YOLOv8的轻量化猪脸关键点检测[J].杭州电子科技大学学报（自然科学版）,2024,44(2):51-64.
10张保生,王旭.2019-2021年中国证据法治发展的步伐[J].证据科学,2024,32(2):133-177.

1常京霞.试析声乐演唱中的情绪运用价值[J].戏剧之家,2024(19):81-83.
2Hui Wang,Zhonghao Li,Qisheng Wang,Weixin Lin,Ziting Zhou,Xinru Mu,Yongwei Jiang,Shengfeng Lu,Shaodong Chen,Zhigang Lu.Gypenosides ameliorate morphine-induced immunosuppression with an increased proportion of thymic T lymphocyte subsets and are involved in the regulation of the cAMP-CREM/CREB-IL-2 pathway[J].Genes & Diseases,2024,11(3):120-123.
3何杨柳青,梁芬荣,王艺明,魏雨寒,马添佩.可穿戴设备在抑郁症监测与干预领域中的应用进展[J].中国医疗器械杂志,2024,48(4):407-412.
4袁秀秀,王煜.CREM基因在miR-199b-5p敲除小鼠精液和睾丸中的表达研究[J].中华男科学杂志,2024,30(2):111-117. 被引量：1
5李海,王琪,韩健.多媒体CBL联合改良Mini-CEX教学模式在消化内科临床教学中的应用[J].中文科技期刊数据库（引文版）教育科学,2024(8):0158-0161.
6Hui JIANG,Shi Min LI,Wei Gang WANG.Moderate Deviations for Parameter Estimation in the Fractional Ornstein–Uhlenbeck Processes with Periodic Mean[J].Acta Mathematica Sinica,English Series,2024,40(5):1308-1324.
7张立祥.探析虚拟现实技术在优化小班额教学中的应用[J].新智慧,2024(15):22-24.
8许蒙.AI生成技术介入后的商业摄影美学变化趋势分析[J].动画大王,2024(1):0113-0115.
9张司同.AIGC绘画在符号学视域下的价值重构[J].美术文献,2024(1):23-25.
10姚灵,段庆红,张顺,杨洁,陈振涛.深度学习重建辅助流动敏感黑血序列在豆纹动脉MRI的临床应用[J].实用放射学杂志,2024,40(7):1048-1051.

计算机应用研究

2024年第8期

浏览历史

内容加载中请稍等...

融合多情感的语音驱动虚拟说话人生成方法

参考文献4

二级参考文献6

共引文献15

相关作者

相关机构

相关主题

浏览历史