摘要
语音合成技术是指给定文本经过模型处理生成目标说话人语音的过程,该技术在现实社会中已经得到广泛应用。在众多的语音合成模型中,VITS(The Variational Inference for Text-to-Speech)模型将多任务损失函数进行有效组合,相比以往的模型,能够生成质量更高、听感更自然的语音。然而,现有模型依赖多个损失函数,暂时缺乏对其有效权衡的研究。因此,在现有模型损失函数的基础上,引入了梯度归一化自适应损失平衡优化方法,它根据模型不同损失函数的量级与不同子任务的训练速度来平衡各损失函数之间的权重,以验证该方法在语音合成任务中的适用性。在公开的中文语音合成数据集上评估了该方法合成语音的准确度与自然度,结果表明,采用此损失函数的模型在性能上得到了提升,证明了方法的有效性。
Text-to-Speech(TTS)synthesis refers to the process of generating target speaker's speech from given text through mod-el processing.It has become a crucial component in numerous applications.The Variational Inference for Text-to-Speech(VITS)model represents a significant advancement in TTS technology,offering superior speech quality and a more natural sound compared to tradi-tional two-stage models.However,it is crucial to note that the performance of the VITS model is highly sensitive to how its losses are balanced.Currently,there is a lack of research on the effective balance of the losses.This study introduced Gradient Normalization for adaptive loss balancing in end-to-end speech synthesis as a means to identify the optimal balance for the VITS model.This method aimed to enhance the model's adaptability by dynamically adjusting the weighting of different loss components during training.To as-sess the accuracy and naturalness of synthesized speech using our proposed approach,the study conducted experiments using a publicly available Chinese TTS dataset.The results demonstrated that models using this method to balance losses had seen performance improve-ments,confirming the effectiveness of the approach.The significance of this research lies in its contribution to advancing TTS technolo-gy,particularly in the context of the VITS model.
作者
陈宽
陈涛
尤玮珂
周琳娜
杨忠良
CHEN Kuan;CHEN Tao;YOU Weike;ZHOU Linna;YANG Zhongliang(School of Cyberspace Security,Beijing University of Posts and Telecommunications,Beijing 102206,China;Information Security Research Center,Beijing University of Posts and Telecommunications,Beijing 102206,China;School of Cyber Science and Engineering,University of International Relations,Beijing 100091,China)
基金
国家自然科学基金(62172053,62302059)
国家重点研发计划(2021YFC3340602,2022YFC3300800,2023YFC3305401)
中央高校基本科研业务费(2023RC30)。
关键词
文本转语音
端到端语音合成
多任务学习
多目标优化
梯度归一化
text-to-speech
end-to-end speech synthesis
multi-task learning
multi-objective optimization
gradient normalization