This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral...This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral speech corpus, a multi-speaker Mandarin neutral speech corpus and a multi-speaker Mandarin emotional speech corpus are firstly employed to train a set of mixed language average acoustic models of target emotion by using speaker adaptive training.Then a one-speaker Mandarin neutral speech corpus or a one-speaker Tibetan neutral speech corpus is adopted to obtain a set of speaker dependent acoustic models of target emotion by using the speaker adap-tation transformation. The Mandarin emotional speech or the Tibetan emotional speech is finally synthesized from Mandarin speaker depen-dent acoustic models of target emotion or Tibetan speaker dependent acoustic models of target emotion. Subjective tests show that the aver-age emotional mean opinion score is 4.14 for Tibetan and 4.26 for Mandarin. The average mean opinion score is 4.16 for Tibetan and 4.28 for Mandarin. The average degradation opinion score is 4.28 for Tibetan and 4.24 for Mandarin. Therefore, the proposed method can synthesize both Tibetan speech and Mandarin speech with high naturalness and emotional expression by using only Mandarin emotional training speech corpus.展开更多
A stronger canonical model was developed to improve the performance of automatic pronunciation evaluations. Three different strategies were investigated with speaker adaptive training to normalize variations among spe...A stronger canonical model was developed to improve the performance of automatic pronunciation evaluations. Three different strategies were investigated with speaker adaptive training to normalize variations among speakers, minimum phone error training to identify easily confused phones and maximum likelihood linear regression (MLLR) adaptation to compensate for accent variations between native and non-native speakers. The three schemes were combined to improve the correlation coefficient between machine scores and human scores from 0.651 to 0.679 on the sentence level and from 0.788 to 0.822 on the speaker level.展开更多
文摘This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral speech corpus, a multi-speaker Mandarin neutral speech corpus and a multi-speaker Mandarin emotional speech corpus are firstly employed to train a set of mixed language average acoustic models of target emotion by using speaker adaptive training.Then a one-speaker Mandarin neutral speech corpus or a one-speaker Tibetan neutral speech corpus is adopted to obtain a set of speaker dependent acoustic models of target emotion by using the speaker adap-tation transformation. The Mandarin emotional speech or the Tibetan emotional speech is finally synthesized from Mandarin speaker depen-dent acoustic models of target emotion or Tibetan speaker dependent acoustic models of target emotion. Subjective tests show that the aver-age emotional mean opinion score is 4.14 for Tibetan and 4.26 for Mandarin. The average mean opinion score is 4.16 for Tibetan and 4.28 for Mandarin. The average degradation opinion score is 4.28 for Tibetan and 4.24 for Mandarin. Therefore, the proposed method can synthesize both Tibetan speech and Mandarin speech with high naturalness and emotional expression by using only Mandarin emotional training speech corpus.
基金Supported by the National High-Tech Research and Development (863) Program of China (No. 2008AA01Z118)
文摘A stronger canonical model was developed to improve the performance of automatic pronunciation evaluations. Three different strategies were investigated with speaker adaptive training to normalize variations among speakers, minimum phone error training to identify easily confused phones and maximum likelihood linear regression (MLLR) adaptation to compensate for accent variations between native and non-native speakers. The three schemes were combined to improve the correlation coefficient between machine scores and human scores from 0.651 to 0.679 on the sentence level and from 0.788 to 0.822 on the speaker level.
基金National High Technology Research &Development Programme(863 )of P. R. China(2 0 0 1 AA41 80 )Zhejiang Provincial Natural Science Foundation for Young Scientist of P. R. China(RC0 1 0 58)