摘要
韵律结构预测作为语音合成系统中的一个关键步骤,其结果直接影响合成语音的自然度和可懂度.本文提出了一种基于预训练语言表示模型的韵律结构预测方法,以字为建模单位,在预训练语言模型的基础上对每个韵律层级设置了独立的输出层,利用韵律标注数据对预训练模型进行微调.另外在此基础上额外增加了分词任务,通过多任务学习的方法对各韵律层级间的关系及韵律与词间的关系建模,实现对输入文本各级韵律边界的同时预测.实验首先证明了多输出结构设置的合理性及使用预训练模型的有效性,并验证了分词任务的加入可以进一步提升模型性能;将最优的结果与设置的两个基线模型相比,在韵律词和韵律短语预测的F1值上与条件随机场模型相比分别有2.48%和4.50%的绝对提升,而与双向长短时记忆网络相比分别有6.2%和5.4%的绝对提升;最后实验表明该方法可以在保证预测性能的同时减少对训练数据量的需求.
Prosodic structure prediction is an indispensable step in the text-to-speech system,and its results directly influence the naturalness and intelligibility of synthesized speech.In this study,a prosodic structure prediction method based on a pretrained language representation model was proposed.On the basis of the pretrained language representation model,a separate output layer was set for each prosody level,with character as the modeling unit.Then,the model was fine-tuned with prosody labeled data.To achieve the simultaneous prediction of different prosodic levels in input text,a word segmentation task was additionally introduced and the multitask learning method was used to model the relationship between the multilevel prosody and lexicon words.The experimental results prove the rationality of a multi-output structure and the effectiveness of using a pretrained language representation model and verify that adding the word segmentation task can further improve model performance.When comparing the best result to the baseline conditional random field model,significant improvements of 2.48% and 4.50% were observed for the F1 scores of prosodic word prediction and prosodic phrase prediction,respectively.By contrast,when comparing the best result to the baseline bidirectional long short-term memory model,more significant improvements of 6.2% and 5.4% were observed for the F1 scores of prosodic word prediction and prosodic phrase prediction,respectively.Finally,the experiments show that the proposed method considerably reduces the demand for training data while maintaining an excellent prediction performance.
作者
张鹏远
卢春晖
王睿敏
Zhang Pengyuan;Lu Chunhui;Wang Ruimin(Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China;School of Electronic,Electrical and Communication Engineering,University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《天津大学学报(自然科学与工程技术版)》
EI
CSCD
北大核心
2020年第3期265-271,共7页
Journal of Tianjin University:Science and Technology
基金
国家自然科学基金资助项目(11590773,11590770)
全军共用信息系统装备预研项目(JZX2017-0994/Y306)~~
关键词
韵律结构预测
预训练语言表示模型
多任务学习
语音合成
prosodic structure prediction
pretrained language representation model
multitask learning
speech synthesis