This paper proposes a trainable unit selection speech synthesis method based on statistical modeling framework. At training stage, acoustic features are extracted from the training database and statistical models are ...This paper proposes a trainable unit selection speech synthesis method based on statistical modeling framework. At training stage, acoustic features are extracted from the training database and statistical models are estimated for each feature. During synthesis, the optimal candidate unit sequence is searched out from the database following the maximum likelihood criterion derived from the trained models. Finally, the waveforms of the optimal candidate units are concatenated to produce synthetic speech. Experiment results show that this method can improve the automation of system construction and naturalness of synthetic speech effectively compared with the conventional unit selection synthe-sis method. Furthermore, this paper presents a minimum unit selection error model training criterion according to the characteristics of unit selection speech synthesis and adopts discriminative training for model parameter estimation. This criterion can finally achieve the full automation of system con-struction and improve the naturalness of synthetic speech further.展开更多
The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testi...The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testing of hypotheses and results of experiments, satisfy three conditions: independence from the selection of unit for the synthesis (word or any part of it); taking into account the environment of unit (left and right hand contexts and position of unit); independence from the content of base. Such synthesizer is a good tool for studying many aspects of speech and removes the problem of selection. We can vary the unit and other parameters, described in paper, by the same synthesizer, synthesize the same text and listen to the results directly. This paper describes the formal structure of experimental Georgian speech synthesizer.展开更多
The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, us...The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, usually results in loss of non-uniform synthesis instances. In order to solve this problem, we propose the concept of virtual non-uniform instances. According to this concept and the synthesis frequency of each instance, the algorithm named StaRp-VPA is constructed to make up for the loss of nonuniform instances. In experimental testing, the naturalness scored by the mean opinion score (MOS) remains almost unchanged when less than 50% instances are pruned, and the MOS is only slightly degraded for reduction rates above 50%. The test results show that the algorithm StaRp-VPA is effective.展开更多
基金Supported by the National Natural Science Foundation of China (Grant Nos. 60475015, 60610298) National Hi-Tech Research and Development Program of China (Grant Nos. 2006AA01Z137 and 2006AA010104)
文摘This paper proposes a trainable unit selection speech synthesis method based on statistical modeling framework. At training stage, acoustic features are extracted from the training database and statistical models are estimated for each feature. During synthesis, the optimal candidate unit sequence is searched out from the database following the maximum likelihood criterion derived from the trained models. Finally, the waveforms of the optimal candidate units are concatenated to produce synthetic speech. Experiment results show that this method can improve the automation of system construction and naturalness of synthetic speech effectively compared with the conventional unit selection synthe-sis method. Furthermore, this paper presents a minimum unit selection error model training criterion according to the characteristics of unit selection speech synthesis and adopts discriminative training for model parameter estimation. This criterion can finally achieve the full automation of system con-struction and improve the naturalness of synthetic speech further.
文摘The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testing of hypotheses and results of experiments, satisfy three conditions: independence from the selection of unit for the synthesis (word or any part of it); taking into account the environment of unit (left and right hand contexts and position of unit); independence from the content of base. Such synthesizer is a good tool for studying many aspects of speech and removes the problem of selection. We can vary the unit and other parameters, described in paper, by the same synthesizer, synthesize the same text and listen to the results directly. This paper describes the formal structure of experimental Georgian speech synthesizer.
基金the National Natural Science Foundation of China (No. 60602017)
文摘The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, usually results in loss of non-uniform synthesis instances. In order to solve this problem, we propose the concept of virtual non-uniform instances. According to this concept and the synthesis frequency of each instance, the algorithm named StaRp-VPA is constructed to make up for the loss of nonuniform instances. In experimental testing, the naturalness scored by the mean opinion score (MOS) remains almost unchanged when less than 50% instances are pruned, and the MOS is only slightly degraded for reduction rates above 50%. The test results show that the algorithm StaRp-VPA is effective.