In this paper,two speech enhancement systems with supergaussian speech modeling are presented. The clean speech components are estimated by Minimum-Mean-Square-Error (MMSE) es-timator under the assumption that the DCT...In this paper,two speech enhancement systems with supergaussian speech modeling are presented. The clean speech components are estimated by Minimum-Mean-Square-Error (MMSE) es-timator under the assumption that the DCT coefficients of clean speech are modeled by a Laplacian or a Gamma distribution and the DCT coefficients of the noise are Gaussian distributed. Then,MMSE estimators under speech presence uncertainty are derived. Furthermore,the proper estimators of the speech statistical parameters are proposed. The speech Laplacian factor is estimated by a new deci-sion-directed method. The simulation results show that the proposed algorithm yields less residual noise and better speech quality than the Gaussian based speech enhancement algorithms proposed in recent years.展开更多
In order to overcome defects of the classical hidden Markov model (HMM), Markov family model (MFM), a new statistical model was proposed. Markov family model was applied to speech recognition and natural language proc...In order to overcome defects of the classical hidden Markov model (HMM), Markov family model (MFM), a new statistical model was proposed. Markov family model was applied to speech recognition and natural language processing. The speaker independently continuous speech recognition experiments and the part-of-speech tagging experiments show that Markov family model has higher performance than hidden Markov model. The precision is enhanced from 94.642% to 96.214% in the part-of-speech tagging experiments, and the work rate is reduced by 11.9% in the speech recognition experiments with respect to HMM baseline system.展开更多
Modification on time scale and pitch scale of Chinese syllable based on sinusoidal model is presented in this paper. Firstly, the short term speech is decomposed into a sum of sinusoidal waves of different magnitud...Modification on time scale and pitch scale of Chinese syllable based on sinusoidal model is presented in this paper. Firstly, the short term speech is decomposed into a sum of sinusoidal waves of different magnitudes and phases. Then vocal tract system and excitation are obtained using a homomophic technique. Lastly, the speech with desired time scale and pitch scale is obtained through the change of frequency and phase of excitation while the parameters of vocal tract system are changed accordingly. The results show that the adjustable scale of pitch and time scale is big using this algorithm and it is suitable to be used in analysis and synthesis of Chinese speech.展开更多
A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. T...A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. The computer simulation shows that speech can be synthesized with only 12 parameters per frame on the average. The method has the advantages of few parameters, low complexity and high performance of speech representation. The synthetic speech has high intelligibility.展开更多
The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pi...The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pitch pattern model are deeply studied in this paper.Key problems in the corpus-based model are calculation of the distance and searching of the optimal path with dynamic programming algorithm.For the pitch pattern model,parameters such as pitch pattern,pitch average and pitch range are used to describe the pitch contour,and six pitch patterns are presented.For the generation of pitch contour,the pitch pattern model is more flexible than the corpus-based model.Both of the two models are linked to the real TTS system,and the MOS results of synthesized Mandarin speech show that the pitch pattern model is better than the corpus-based pitch model.展开更多
The Autoregressive Moving Average(ARMA)model for whispered speech is proposed.Compared with normal speech,whispered speech has no fundamental frequency because of the glottis being semi-opened and turbulent flow being...The Autoregressive Moving Average(ARMA)model for whispered speech is proposed.Compared with normal speech,whispered speech has no fundamental frequency because of the glottis being semi-opened and turbulent flow being created,and formant shifting exists in the lower frequency region due to the narrowing of the tract in the false vocal fold regions and weak acoustic coupling with the subglottal system.Analysis shows that the effect of the subglottal system is to introduce additional pole-zero pairs into the vocal tract transfer function.Theoretically,the method based on an ARMA process is superior to that based on an AR process in the spectral analysis of the whispered speech.Two methods,the least squared modified Yule-Walker likelihood estimate(LSMY)algorithm and the Frequency-Domain Steiglitz-Mcbride(FDSM)algorithm,are applied to the ARMA model for the whispered speech.The performance evaluation shows that the ARMA model is much more appropriate for representing the whispered speech than the AR model,and the FDSM algorithm provides a more accurate estimation of the whispered speech spectral envelope than the LSMY algorithm with higher computational complexity.展开更多
Automatic speech recognition (ASR) is vital for very low-resource languages for mitigating the extinction trouble. Chaha is one of the low-resource languages, which suffers from the problem of resource insufficiency a...Automatic speech recognition (ASR) is vital for very low-resource languages for mitigating the extinction trouble. Chaha is one of the low-resource languages, which suffers from the problem of resource insufficiency and some of its phonological, morphological, and orthographic features challenge the development and initiatives in the area of ASR. By considering these challenges, this study is the first endeavor, which analyzed the characteristics of the language, prepared speech corpus, and developed different ASR systems. A small 3-hour read speech corpus was prepared and transcribed. Different basic and rounded phone unit-based speech recognizers were explored using multilingual deep neural network (DNN) modeling methods. The experimental results demonstrated that all the basic phone and rounded phone unit-based multilingual models outperformed the corresponding unilingual models with the relative performance improvements of 5.47% to 19.87% and 5.74% to 16.77%, respectively. The rounded phone unit-based multilingual models outperformed the equivalent basic phone unit-based models with relative performance improvements of 0.95% to 4.98%. Overall, we discovered that multilingual DNN modeling methods are profoundly effective to develop Chaha speech recognizers. Both the basic and rounded phone acoustic units are convenient to build Chaha ASR system. However, the rounded phone unit-based models are superior in performance and faster in recognition speed over the corresponding basic phone unit-based models. Hence, the rounded phone units are the most suitable acoustic units to develop Chaha ASR systems.展开更多
This paper presents the recognition of “Baoule” spoken sentences, a language of C?te d’Ivoire. Several formalisms allow the modelling of an automatic speech recognition system. The one we used to realize our system...This paper presents the recognition of “Baoule” spoken sentences, a language of C?te d’Ivoire. Several formalisms allow the modelling of an automatic speech recognition system. The one we used to realize our system is based on Hidden Markov Models (HMM) discreet. Our goal in this article is to present a system for the recognition of the Baoule word. We present three classical problems and develop different algorithms able to resolve them. We then execute these algorithms with concrete examples.展开更多
基金the Natural Science Foundation of Jiangsu Province (No.BK2006001).
文摘In this paper,two speech enhancement systems with supergaussian speech modeling are presented. The clean speech components are estimated by Minimum-Mean-Square-Error (MMSE) es-timator under the assumption that the DCT coefficients of clean speech are modeled by a Laplacian or a Gamma distribution and the DCT coefficients of the noise are Gaussian distributed. Then,MMSE estimators under speech presence uncertainty are derived. Furthermore,the proper estimators of the speech statistical parameters are proposed. The speech Laplacian factor is estimated by a new deci-sion-directed method. The simulation results show that the proposed algorithm yields less residual noise and better speech quality than the Gaussian based speech enhancement algorithms proposed in recent years.
基金Project(60763001)supported by the National Natural Science Foundation of ChinaProjects(2009GZS0027,2010GZS0072)supported by the Natural Science Foundation of Jiangxi Province,China
文摘In order to overcome defects of the classical hidden Markov model (HMM), Markov family model (MFM), a new statistical model was proposed. Markov family model was applied to speech recognition and natural language processing. The speaker independently continuous speech recognition experiments and the part-of-speech tagging experiments show that Markov family model has higher performance than hidden Markov model. The precision is enhanced from 94.642% to 96.214% in the part-of-speech tagging experiments, and the work rate is reduced by 11.9% in the speech recognition experiments with respect to HMM baseline system.
文摘Modification on time scale and pitch scale of Chinese syllable based on sinusoidal model is presented in this paper. Firstly, the short term speech is decomposed into a sum of sinusoidal waves of different magnitudes and phases. Then vocal tract system and excitation are obtained using a homomophic technique. Lastly, the speech with desired time scale and pitch scale is obtained through the change of frequency and phase of excitation while the parameters of vocal tract system are changed accordingly. The results show that the adjustable scale of pitch and time scale is big using this algorithm and it is suitable to be used in analysis and synthesis of Chinese speech.
文摘A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. The computer simulation shows that speech can be synthesized with only 12 parameters per frame on the average. The method has the advantages of few parameters, low complexity and high performance of speech representation. The synthetic speech has high intelligibility.
基金Sponsored by the National Natural Science Foundation of China(Grant No.60503071)the 973 National Basic Research Program of China(Grant No.2004CB318102)the Postdoctor Science Foundation of China(Grant No.20070420275)
文摘The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pitch pattern model are deeply studied in this paper.Key problems in the corpus-based model are calculation of the distance and searching of the optimal path with dynamic programming algorithm.For the pitch pattern model,parameters such as pitch pattern,pitch average and pitch range are used to describe the pitch contour,and six pitch patterns are presented.For the generation of pitch contour,the pitch pattern model is more flexible than the corpus-based model.Both of the two models are linked to the real TTS system,and the MOS results of synthesized Mandarin speech show that the pitch pattern model is better than the corpus-based pitch model.
基金supported by the Independent Innovation Foundation of Shandong University(No.2009JC004)the Natural Science Foundation of Shandong Province(No.Y2007G31)
文摘The Autoregressive Moving Average(ARMA)model for whispered speech is proposed.Compared with normal speech,whispered speech has no fundamental frequency because of the glottis being semi-opened and turbulent flow being created,and formant shifting exists in the lower frequency region due to the narrowing of the tract in the false vocal fold regions and weak acoustic coupling with the subglottal system.Analysis shows that the effect of the subglottal system is to introduce additional pole-zero pairs into the vocal tract transfer function.Theoretically,the method based on an ARMA process is superior to that based on an AR process in the spectral analysis of the whispered speech.Two methods,the least squared modified Yule-Walker likelihood estimate(LSMY)algorithm and the Frequency-Domain Steiglitz-Mcbride(FDSM)algorithm,are applied to the ARMA model for the whispered speech.The performance evaluation shows that the ARMA model is much more appropriate for representing the whispered speech than the AR model,and the FDSM algorithm provides a more accurate estimation of the whispered speech spectral envelope than the LSMY algorithm with higher computational complexity.
文摘Automatic speech recognition (ASR) is vital for very low-resource languages for mitigating the extinction trouble. Chaha is one of the low-resource languages, which suffers from the problem of resource insufficiency and some of its phonological, morphological, and orthographic features challenge the development and initiatives in the area of ASR. By considering these challenges, this study is the first endeavor, which analyzed the characteristics of the language, prepared speech corpus, and developed different ASR systems. A small 3-hour read speech corpus was prepared and transcribed. Different basic and rounded phone unit-based speech recognizers were explored using multilingual deep neural network (DNN) modeling methods. The experimental results demonstrated that all the basic phone and rounded phone unit-based multilingual models outperformed the corresponding unilingual models with the relative performance improvements of 5.47% to 19.87% and 5.74% to 16.77%, respectively. The rounded phone unit-based multilingual models outperformed the equivalent basic phone unit-based models with relative performance improvements of 0.95% to 4.98%. Overall, we discovered that multilingual DNN modeling methods are profoundly effective to develop Chaha speech recognizers. Both the basic and rounded phone acoustic units are convenient to build Chaha ASR system. However, the rounded phone unit-based models are superior in performance and faster in recognition speed over the corresponding basic phone unit-based models. Hence, the rounded phone units are the most suitable acoustic units to develop Chaha ASR systems.
文摘This paper presents the recognition of “Baoule” spoken sentences, a language of C?te d’Ivoire. Several formalisms allow the modelling of an automatic speech recognition system. The one we used to realize our system is based on Hidden Markov Models (HMM) discreet. Our goal in this article is to present a system for the recognition of the Baoule word. We present three classical problems and develop different algorithms able to resolve them. We then execute these algorithms with concrete examples.