摘要
在大多数语音合成系统中,预测的Mel谱的质量直接决定了最终合成语音的质量。基于Tacotron 2的框架预测的Mel谱通常缺乏接近真实数据的精细结构,为解决此问题,提出一种基于CBHG的后处理网络。该后处理网络通过对解码输出的Mel谱进行分析并预测其缺失的精细结构,最后将这些精细结构叠加到解码输出的Mel谱中以生成细化后的Mel谱,从而提高合成语音的质量。实验结果表明,提出的后处理网络有效恢复了Mel谱在解码过程中丢失的精细结构,同时通过结合高性能、高效率的HiFi-GAN声码器,最终合成语音的平均主观意见分(Mean Opinion Score,MOS)达到4.10,相比基线提升了0.26。
In most speech synthesis systems,the quality of the predicted Mel spectrum directly determines the quality of the final synthesized speech.The Mel spectrum predicted by the Tacotron 2 framework usually lacks a fine structure close to the real data.To solve this problem,this paper proposes a post-processing network based on CBHG.The post-processing network analyzes the decoded output Mel spectrum and predicts its missing fine structures,and finally superimposes these fine structures on the decoded output Mel spectrum to generate a refined Mel spectrum,thereby improving the quality of synthesized speech.The experimental results show that the post-processing network proposed in this paper effectively restores the fine structure of the Mel spectrum lost in the decoding process.Meanwhile,by combining the high-performance and high-efficiency HiFi-GAN vocoder,the Mean Opinion Score(MOS)of the final synthesized speech reaches 4.10,an increase of 0.26 compared to the baseline.
作者
唐君
张连海
李嘉欣
TANG Jun;ZHANG Lianhai;LI Jiaxin(Information Engineering University,Zhengzhou 450001,China)
出处
《信息工程大学学报》
2022年第2期135-140,共6页
Journal of Information Engineering University
基金
国家自然科学基金资助项目(61673395,62171470)。