基于梯度归一化的端到端语音合成自适应损失权衡

Gradient normalization for adaptive loss balancing in end-to-end speech synthesis

下载PDF

导出

摘要语音合成技术是指给定文本经过模型处理生成目标说话人语音的过程,该技术在现实社会中已经得到广泛应用。在众多的语音合成模型中,VITS(The Variational Inference for Text-to-Speech)模型将多任务损失函数进行有效组合,相比以往的模型,能够生成质量更高、听感更自然的语音。然而,现有模型依赖多个损失函数,暂时缺乏对其有效权衡的研究。因此,在现有模型损失函数的基础上,引入了梯度归一化自适应损失平衡优化方法,它根据模型不同损失函数的量级与不同子任务的训练速度来平衡各损失函数之间的权重,以验证该方法在语音合成任务中的适用性。在公开的中文语音合成数据集上评估了该方法合成语音的准确度与自然度,结果表明,采用此损失函数的模型在性能上得到了提升,证明了方法的有效性。 Text-to-Speech(TTS)synthesis refers to the process of generating target speaker's speech from given text through mod-el processing.It has become a crucial component in numerous applications.The Variational Inference for Text-to-Speech(VITS)model represents a significant advancement in TTS technology,offering superior speech quality and a more natural sound compared to tradi-tional two-stage models.However,it is crucial to note that the performance of the VITS model is highly sensitive to how its losses are balanced.Currently,there is a lack of research on the effective balance of the losses.This study introduced Gradient Normalization for adaptive loss balancing in end-to-end speech synthesis as a means to identify the optimal balance for the VITS model.This method aimed to enhance the model's adaptability by dynamically adjusting the weighting of different loss components during training.To as-sess the accuracy and naturalness of synthesized speech using our proposed approach,the study conducted experiments using a publicly available Chinese TTS dataset.The results demonstrated that models using this method to balance losses had seen performance improve-ments,confirming the effectiveness of the approach.The significance of this research lies in its contribution to advancing TTS technolo-gy,particularly in the context of the VITS model.

作者陈宽陈涛尤玮珂周琳娜杨忠良 CHEN Kuan;CHEN Tao;YOU Weike;ZHOU Linna;YANG Zhongliang(School of Cyberspace Security,Beijing University of Posts and Telecommunications,Beijing 102206,China;Information Security Research Center,Beijing University of Posts and Telecommunications,Beijing 102206,China;School of Cyber Science and Engineering,University of International Relations,Beijing 100091,China)

机构地区北京邮电大学网络空间安全学院北京邮电大学信息安全中心国际关系学院网络空间安全学院

出处《网络空间安全科学学报》 2024年第1期72-82,共11页 Journal of Cybersecurity

基金国家自然科学基金(62172053,62302059) 国家重点研发计划(2021YFC3340602,2022YFC3300800,2023YFC3305401) 中央高校基本科研业务费(2023RC30)。

关键词文本转语音端到端语音合成多任务学习多目标优化梯度归一化 text-to-speech end-to-end speech synthesis multi-task learning multi-objective optimization gradient normalization

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1陈连虎.AI“淘金”,数字人跃立潮头[J].软件和集成电路,2024(4):36-41.
2王寅达,彭玲,陈德跃,李玮超.基于改进U-Net模型的农业大棚遥感提取方法[J].中国科学院大学学报（中英文）,2024,41(3):375-386. 被引量：1
3欧阳继红,曹竞月,王腾.Copula层次化变分推理[J].吉林大学学报（信息科学版）,2024,42(1):51-58.
4李心怡.成为“江浙沪独生女”人生就不再有风险了吗?[J].看天下,2023(23):89-89.
5陈嘉伟,季天瑶,梅广,刘紫罡.基于多尺度特征融合与多任务学习框架的非侵入式负荷监测方法[J].电网技术,2024,48(5):2074-2083.
6全安坤,李红莲,张乐,吕学强.融合内容和图片特征的中文摘要生成方法研究[J].数据分析与知识发现,2024,8(3):110-119. 被引量：1
7李鹏程,张旭龙,王健宗,程宁,肖京.面向非平行语料的语音转换技术综述[J].大数据,2024,10(3):65-81.
8郭傲,许柏炎,蔡瑞初,郝志峰.基于时序对齐的风格控制语音合成算法[J].广东工业大学学报,2024,41(2):84-92.
9王卓,刘小莞.元宇宙:时空再造与虚实相融的社会新形态[J].复印报刊资料（社会学）,2023(3):86-96.
10苏丹.绿皮车厢长征路[J].今日文摘,2023(14):8-8.

网络空间安全科学学报

2024年第1期

浏览历史

内容加载中请稍等...

基于梯度归一化的端到端语音合成自适应损失权衡

相关作者

相关机构

相关主题

浏览历史