摘要
This paper proposes a trainable unit selection speech synthesis method based on statistical modeling framework. At training stage, acoustic features are extracted from the training database and statistical models are estimated for each feature. During synthesis, the optimal candidate unit sequence is searched out from the database following the maximum likelihood criterion derived from the trained models. Finally, the waveforms of the optimal candidate units are concatenated to produce synthetic speech. Experiment results show that this method can improve the automation of system construction and naturalness of synthetic speech effectively compared with the conventional unit selection synthe-sis method. Furthermore, this paper presents a minimum unit selection error model training criterion according to the characteristics of unit selection speech synthesis and adopts discriminative training for model parameter estimation. This criterion can finally achieve the full automation of system con-struction and improve the naturalness of synthetic speech further.
This paper proposes a trainable unit selection speech synthesis method based on statistical modeling framework. At training stage, acoustic features are extracted from the training database and statistical models are estimated for each feature. During synthesis, the optimal candidate unit sequence is searched out from the database following the maximum likelihood criterion derived from the trained models. Finally, the waveforms of the optimal candidate units are concatenated to produce synthetic speech. Experiment results show that this method can improve the automation of system construction and naturalness of synthetic speech effectively compared with the conventional unit selection synthe- sis method. Furthermore, this paper presents a minimum unit selection error model training criterion according to the characteristics of unit selection speech synthesis and adopts discriminative training for model parameter estimation. This criterion can finally achieve the full automation of system con- struction and improve the naturalness of synthetic speech further.
基金
Supported by the National Natural Science Foundation of China (Grant Nos. 60475015, 60610298)
National Hi-Tech Research and Development Program of China (Grant Nos. 2006AA01Z137 and 2006AA010104)