期刊文献+

融合多情感的语音驱动虚拟说话人生成方法

Multi-emotion driven virtual speaker generation method integrating multiple emotions
下载PDF
导出
摘要 虚拟说话人生成是人工智能领域的一个重要研究方向,旨在通过计算机生成具有逼真语音的虚拟说话人。然而,现有方法往往忽视情绪表达、生成的人脸图像面部细节缺乏真实感,限制了虚拟说话人的表现能力和交互性。为解决这一问题,提出一种基于Transformer的生成对抗网络(generative adversarial network,GAN)方法,用于生成具有不同情绪的虚拟说话人(GANLTB)。该方法基于GAN架构,生成器采用Transformer模型处理语音和图像特征,结合情绪条件信息和潜在空间向量,生成带有指定情绪的语音和图像。判别器用于评估生成结果的真实性,并提供梯度信号指导生成器训练。通过引入双三次插值法,进一步提升了虚拟说话人生成的图像质量,使得虚拟说话人的面部细节更加清晰可见,表情更加自然和生动。使用情感多样性数据集(CREMA-D)验证了该方法,通过主观评估和客观指标,评估了生成的语音和图像的情绪表达能力和质量。实验结果表明,该方法能够生成具有多样化和逼真情绪表达的虚拟说话人。相比目前其他先进方法,所提方法在流畅度和逼真度等细节上都更加清晰,带来了更好的真实感。 Virtual speaker generation is an important research direction in the field of artificial intelligence,aiming to gene-rate virtual speakers with realistic voices through computers.However,existing methods often neglect emotional expression and the facial details of the generated face images lack realism,which limit the performance and interactivity of the virtual spea-kers.To address this issue,this paper proposed a Transformer-based generative adversarial network(GAN)method for generating virtual speakers with different emotions(GANLTB).This method was based on the GAN network architecture,where the generator used a Transformer model to process speech and image features,combined with emotional condition information and latent space vectors,generating voice and images with specified emotions.It used the discriminator to assess the authenticity of the generated results and provide gradient signals to guide the training of the generator.By introducing BiCubic interpolation,it further enhanced the image quality of the virtual speaker generation,making the facial details of the virtual speaker clearer and the expressions more natural and vivid.The method was validated using a diverse emotional dataset CREMA-D,through subjective evaluation and objective indicators to assess the emotional expression ability and quality of the generated speech and images.Experimental results show that the method can generate virtual speakers with diverse and realistic emotional expressions.Compared to other currently advanced methods,the proposed method is clearer in details such as fluency and realism,bringing a better sense of reality.
作者 李帅帅 何向真 张跃洲 王嘉欣 Li Shuaishuai;He Xiangzhen;Zhang Yuezhou;Wang Jiaxin(Key Laboratory of Linguistic&Cultural Computing Ministry of Education,Northwest Minzu University,Lanzhou 730030,China;Key Laboratory of Ethnic Language&Cultural Intelligent Information Processing,Northwest Minzu University,Lanzhou 730030,China)
出处 《计算机应用研究》 CSCD 北大核心 2024年第8期2546-2553,共8页 Application Research of Computers
基金 国家自然科学基金资助项目(62341209) 甘肃省教育教学成果培育项目(2023GSJXCGPY-60) 中央高校基本科研业务费专项资金资助项目(31920230054)。
关键词 虚拟说话人 生成对抗网络 TRANSFORMER 多情感表达 语音驱动 virtual speaker GAN Transformer multi-emotion expression voice-driven
  • 相关文献

参考文献4

二级参考文献6

共引文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部