This paper proposes a technique for synthesizing a pixel-based photo-realistic talking face animation using two-step synthesis with HMMs and DNNs. We introduce facial expression parameters as an intermediate represent...This paper proposes a technique for synthesizing a pixel-based photo-realistic talking face animation using two-step synthesis with HMMs and DNNs. We introduce facial expression parameters as an intermediate representation that has a good correspondence with both of the input contexts and the output pixel data of face images. The sequences of the facial expression parameters are modeled using context-dependent HMMs with static and dynamic features. The mapping from the expression parameters to the target pixel images are trained using DNNs. We examine the required amount of the training data for HMMs and DNNs and compare the performance of the proposed technique with the conventional PCA-based technique through objective and subjective evaluation experiments.展开更多
A fuzzy clustering analysis based phonetic tied-mixture HMM(FPTM) was presented to decrease parameter size and improve robustness of parameter training. FPTM was synthesized from state-tied HMMs by a modified fuzzy C-...A fuzzy clustering analysis based phonetic tied-mixture HMM(FPTM) was presented to decrease parameter size and improve robustness of parameter training. FPTM was synthesized from state-tied HMMs by a modified fuzzy C-means clustering algorithm. Each Gaussian codebook of FPTM was built from Gaussian components within the same root node in phonetic decision tree. The experimental results on large vocabulary Mandarin speech recognition show that compared with conventional phonetic tied-mixture HMM and state-tied HMM with approximately the same number of Gaussian mixtures, FPTM achieves word error rate reductions by 4.84% and 13.02% respectively. Combining the two schemes of mixing weights pruning and Gaussian centers fuzzy merging, a significantly parameter size reduction was achieved with little impact on recognition accuracy.展开更多
In this paper the authors look into the problem of Hidden Markov Models (HMM): the evaluation, the decoding and the learning problem. The authors have explored an approach to increase the effectiveness of HMM in th...In this paper the authors look into the problem of Hidden Markov Models (HMM): the evaluation, the decoding and the learning problem. The authors have explored an approach to increase the effectiveness of HMM in the speech recognition field. Although hidden Markov modeling has significantly improved the performance of current speech-recognition systems, the general problem of completely fluent speaker-independent speech recognition is still far from being solved. For example, there is no system which is capable of reliably recognizing unconstrained conversational speech. Also, there does not exist a good way to infer the language structure from a limited corpus of spoken sentences statistically. Therefore, the authors want to provide an overview of the theory of HMM, discuss the role of statistical methods, and point out a range of theoretical and practical issues that deserve attention and are necessary to understand so as to further advance research in the field of speech recognition.展开更多
To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially...To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially important. For these issues, this paper proposes techniques for synthesizing natural-sounding speech that has a rich prosodic personality using a limited amount of data in a text-to-speech (TTS) system. As a target speaker, we chose a well-known prime minister of Japan, Shinzo Abe, who has a good prosodic personality in his speeches. To synthesize natural-sounding and prosodically rich speech, accurate phrasing, robust duration prediction, and rich intonation modeling are important. For these purpose, we propose pause position prediction based on conditional random fields (CRFs), phone-duration prediction using random forests, and mora-based emphasis context labeling. We examine the effectiveness of the above techniques through objective and subjective evaluations.展开更多
This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral...This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral speech corpus, a multi-speaker Mandarin neutral speech corpus and a multi-speaker Mandarin emotional speech corpus are firstly employed to train a set of mixed language average acoustic models of target emotion by using speaker adaptive training.Then a one-speaker Mandarin neutral speech corpus or a one-speaker Tibetan neutral speech corpus is adopted to obtain a set of speaker dependent acoustic models of target emotion by using the speaker adap-tation transformation. The Mandarin emotional speech or the Tibetan emotional speech is finally synthesized from Mandarin speaker depen-dent acoustic models of target emotion or Tibetan speaker dependent acoustic models of target emotion. Subjective tests show that the aver-age emotional mean opinion score is 4.14 for Tibetan and 4.26 for Mandarin. The average mean opinion score is 4.16 for Tibetan and 4.28 for Mandarin. The average degradation opinion score is 4.28 for Tibetan and 4.24 for Mandarin. Therefore, the proposed method can synthesize both Tibetan speech and Mandarin speech with high naturalness and emotional expression by using only Mandarin emotional training speech corpus.展开更多
This paper realizes a sign language-to-speech conversion system to solve the communication problem between healthy people and speech disorders. 30 kinds of different static sign languages are firstly recognized by com...This paper realizes a sign language-to-speech conversion system to solve the communication problem between healthy people and speech disorders. 30 kinds of different static sign languages are firstly recognized by combining the support vector machine (SVM) with a restricted Boltzmann machine (RBM) based regulation and a feedback fine-tuning of the deep model. The text of sign language is then obtained from the recognition results. A context-dependent label is generated from the recognized text of sign language by a text analyzer. Meanwhile,a hiddenMarkov model (HMM) basedMandarin-Tibetan bilingual speech synthesis system is developed by using speaker adaptive training.The Mandarin speech or Tibetan speech is then naturally synthesized by using context-dependent label generated from the recognized sign language. Tests show that the static sign language recognition rate of the designed system achieves 93.6%. Subjective evaluation demonstrates that synthesized speech can get 4.0 of the mean opinion score (MOS).展开更多
Speech recognition systems have become a unique human-computer interaction(HCI)family.Speech is one of the most naturally developed human abilities;speech signal processing opens up a transparent and hand-free computa...Speech recognition systems have become a unique human-computer interaction(HCI)family.Speech is one of the most naturally developed human abilities;speech signal processing opens up a transparent and hand-free computation experience.This paper aims to present a retrospective yet modern approach to the world of speech recognition systems.The development journey of ASR(Automatic Speech Recognition)has seen quite a few milestones and breakthrough technologies that have been highlighted in this paper.A step-by-step rundown of the fundamental stages in developing speech recognition systems has been presented,along with a brief discussion of various modern-day developments and applications in this domain.This review paper aims to summarize and provide a beginning point for those starting in the vast field of speech signal processing.Since speech recognition has a vast potential in various industries like telecommunication,emotion recognition,healthcare,etc.,this review would be helpful to researchers who aim at exploring more applications that society can quickly adopt in future years of evolution.展开更多
文摘This paper proposes a technique for synthesizing a pixel-based photo-realistic talking face animation using two-step synthesis with HMMs and DNNs. We introduce facial expression parameters as an intermediate representation that has a good correspondence with both of the input contexts and the output pixel data of face images. The sequences of the facial expression parameters are modeled using context-dependent HMMs with static and dynamic features. The mapping from the expression parameters to the target pixel images are trained using DNNs. We examine the required amount of the training data for HMMs and DNNs and compare the performance of the proposed technique with the conventional PCA-based technique through objective and subjective evaluation experiments.
基金Supported by the Science and TechnologyCommittee of Shanghai (0 1JC14 0 3 3 )
文摘A fuzzy clustering analysis based phonetic tied-mixture HMM(FPTM) was presented to decrease parameter size and improve robustness of parameter training. FPTM was synthesized from state-tied HMMs by a modified fuzzy C-means clustering algorithm. Each Gaussian codebook of FPTM was built from Gaussian components within the same root node in phonetic decision tree. The experimental results on large vocabulary Mandarin speech recognition show that compared with conventional phonetic tied-mixture HMM and state-tied HMM with approximately the same number of Gaussian mixtures, FPTM achieves word error rate reductions by 4.84% and 13.02% respectively. Combining the two schemes of mixing weights pruning and Gaussian centers fuzzy merging, a significantly parameter size reduction was achieved with little impact on recognition accuracy.
文摘In this paper the authors look into the problem of Hidden Markov Models (HMM): the evaluation, the decoding and the learning problem. The authors have explored an approach to increase the effectiveness of HMM in the speech recognition field. Although hidden Markov modeling has significantly improved the performance of current speech-recognition systems, the general problem of completely fluent speaker-independent speech recognition is still far from being solved. For example, there is no system which is capable of reliably recognizing unconstrained conversational speech. Also, there does not exist a good way to infer the language structure from a limited corpus of spoken sentences statistically. Therefore, the authors want to provide an overview of the theory of HMM, discuss the role of statistical methods, and point out a range of theoretical and practical issues that deserve attention and are necessary to understand so as to further advance research in the field of speech recognition.
文摘To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially important. For these issues, this paper proposes techniques for synthesizing natural-sounding speech that has a rich prosodic personality using a limited amount of data in a text-to-speech (TTS) system. As a target speaker, we chose a well-known prime minister of Japan, Shinzo Abe, who has a good prosodic personality in his speeches. To synthesize natural-sounding and prosodically rich speech, accurate phrasing, robust duration prediction, and rich intonation modeling are important. For these purpose, we propose pause position prediction based on conditional random fields (CRFs), phone-duration prediction using random forests, and mora-based emphasis context labeling. We examine the effectiveness of the above techniques through objective and subjective evaluations.
文摘This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral speech corpus, a multi-speaker Mandarin neutral speech corpus and a multi-speaker Mandarin emotional speech corpus are firstly employed to train a set of mixed language average acoustic models of target emotion by using speaker adaptive training.Then a one-speaker Mandarin neutral speech corpus or a one-speaker Tibetan neutral speech corpus is adopted to obtain a set of speaker dependent acoustic models of target emotion by using the speaker adap-tation transformation. The Mandarin emotional speech or the Tibetan emotional speech is finally synthesized from Mandarin speaker depen-dent acoustic models of target emotion or Tibetan speaker dependent acoustic models of target emotion. Subjective tests show that the aver-age emotional mean opinion score is 4.14 for Tibetan and 4.26 for Mandarin. The average mean opinion score is 4.16 for Tibetan and 4.28 for Mandarin. The average degradation opinion score is 4.28 for Tibetan and 4.24 for Mandarin. Therefore, the proposed method can synthesize both Tibetan speech and Mandarin speech with high naturalness and emotional expression by using only Mandarin emotional training speech corpus.
基金The research leading to these results was partly funded by the National Natural Science Foundation of China (Grant No. 61263036, 61262055), Gansu Science Fund for Distinguished Young Scholars (Grant No. 1210RJDA007) and Natural Science Foundation of Gansu (Grant No. 1506RJYA126).
文摘This paper realizes a sign language-to-speech conversion system to solve the communication problem between healthy people and speech disorders. 30 kinds of different static sign languages are firstly recognized by combining the support vector machine (SVM) with a restricted Boltzmann machine (RBM) based regulation and a feedback fine-tuning of the deep model. The text of sign language is then obtained from the recognition results. A context-dependent label is generated from the recognized text of sign language by a text analyzer. Meanwhile,a hiddenMarkov model (HMM) basedMandarin-Tibetan bilingual speech synthesis system is developed by using speaker adaptive training.The Mandarin speech or Tibetan speech is then naturally synthesized by using context-dependent label generated from the recognized sign language. Tests show that the static sign language recognition rate of the designed system achieves 93.6%. Subjective evaluation demonstrates that synthesized speech can get 4.0 of the mean opinion score (MOS).
文摘Speech recognition systems have become a unique human-computer interaction(HCI)family.Speech is one of the most naturally developed human abilities;speech signal processing opens up a transparent and hand-free computation experience.This paper aims to present a retrospective yet modern approach to the world of speech recognition systems.The development journey of ASR(Automatic Speech Recognition)has seen quite a few milestones and breakthrough technologies that have been highlighted in this paper.A step-by-step rundown of the fundamental stages in developing speech recognition systems has been presented,along with a brief discussion of various modern-day developments and applications in this domain.This review paper aims to summarize and provide a beginning point for those starting in the vast field of speech signal processing.Since speech recognition has a vast potential in various industries like telecommunication,emotion recognition,healthcare,etc.,this review would be helpful to researchers who aim at exploring more applications that society can quickly adopt in future years of evolution.
基金Supported by the National Natural Science Foundation of China under Grant Nos.90412010 60573166 (国家自然科学基金)+1 种基金the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No.2007108 (高等学校博士学科点专项科研基金)the HP University Collaborative Foundation of China under Grant No.HLCFY08-001 (惠普大学合作基金)