In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance...In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance degradation in noisy conditions or distorted channels. It is necessary to search for more robust feature extraction methods to gain better performance in adverse conditions. This paper investigates the performance of conventional and new hybrid speech feature extraction algorithms of Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Coding Coefficient (LPCC), perceptual linear production (PLP), and RASTA-PLP in noisy conditions through using multivariate Hidden Markov Model (HMM) classifier. The behavior of the proposal system is evaluated using TIDIGIT human voice dataset corpora, recorded from 208 different adult speakers in both training and testing process. The theoretical basis for speech processing and classifier procedures were presented, and the recognition results were obtained based on word recognition rate.展开更多
A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering pro...A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering process derived from a RTLA of vocal tract system according to the acoustic mechanism of speech production. The vocal-tract area function which controls the synthesis model is derived from the first three formant trajectories by using the inverse solution of speech production. The proposed method not only gives good naturalness and dynamic smoothness, but also is capable to control or modify speech timbres easily and flexibly. Further and mores it needs less number of control parameters and very low update rate of the parameters.展开更多
The experiment presented in this research is targeting a 'positional' stage of a 'modular' model of speech production originally proposed by Levelt (1989), Bock & Levelt (1994), where selected lemmas are inse...The experiment presented in this research is targeting a 'positional' stage of a 'modular' model of speech production originally proposed by Levelt (1989), Bock & Levelt (1994), where selected lemmas are inserted into syntactic frames. Results suggest a difference between L1 and L2 English speakers at the positional stage. While this might suggest that the speech planning process is different in native and non-native speakers, an alternative view is also proposed that the observed differences are the result of differences in the way that linguistic forms are stored, rather than a fundamental difference in the way that speech is planned. This result indicates main verb, copula be & local dependency effect are the three elements that affect the realization of English subject-verb agreement, and helps us locate the phase where L2 subject-verb agreement errors happen.展开更多
语音生成与获取是一个涉及大脑诸多部位的复杂认知过程,这个过程包括一种从依照句法和语义组织句子或短语的表述一直延伸到音素产生的分层结构。DIVA(directions into velocities of artculators)模型,是一种关于语音生成与获取后描述...语音生成与获取是一个涉及大脑诸多部位的复杂认知过程,这个过程包括一种从依照句法和语义组织句子或短语的表述一直延伸到音素产生的分层结构。DIVA(directions into velocities of artculators)模型,是一种关于语音生成与获取后描述相关处理过程的数学模型,也是一种为了生成单词、音节或者音素,用来控制模拟声道运动的自适应网络模型。在当今真正具有生物学意义的语音生成和获取的神经网络模型中,DIVA模型的定义和测试相对而言是最彻底的,并且是唯一一种应用伪逆控制方案的模型。文中引入基于零空间的伪逆算法,对DIVA模型中的伪逆控制求解算法进行改进,从而更加精确地获得了DIVA模型的相应参数,提高了DIVA模型的鲁棒性。展开更多
A three-dimensional (3-D) physiological articulatory model was developed to account for the biomechanical properties of the speech organs in speech production. Control of the model to investigate the mechanism of sp...A three-dimensional (3-D) physiological articulatory model was developed to account for the biomechanical properties of the speech organs in speech production. Control of the model to investigate the mechanism of speech production requires an efficient control module to estimate muscle activation patterns, which is used to manipulate the 3-D physiological articulatory model, according to the desired articulatory posture. For this purpose, a feedforward control strategy was developed by mapping the articulatory target to the corresponding muscle activation pattern via the intrinsic representation of vowel articulation. In this process, the articulatory postures are first mapped to the corresponding intrinsic representations; then, the articulatory postures are clustered in the intrinsic representations space and a nonlinear function is approximated for each cluster to map the intrinsic representation of vowel articulation to the muscle activation pattern by using general regression neural networks (GRNN). The results show that the feedforward control module is able to manipulate the 3-D physiological articulatory model for vowel production with high accuracy both acoustically and articulatorily.展开更多
The present system experimentally demonstrates a synthesis of syllables and words from tongue manoeuvers in multiple languages,captured by four oral sensors only.For an experimental demonstration of the system used in...The present system experimentally demonstrates a synthesis of syllables and words from tongue manoeuvers in multiple languages,captured by four oral sensors only.For an experimental demonstration of the system used in the oral cavity,a prototype tooth model was used.Based on the principle developed in a previous publication by the author(s),the proposed system has been implemented using the oral cavity(tongue,teeth,and lips)features alone,without the glottis and the larynx.The positions of the sensors in the proposed system were optimized based on articulatory(oral cavity)gestures estimated by simulating the mechanism of human speech.The system has been tested for all English alphabets and several words with sensor-based input along with an experimental demonstration of the developed algorithm,with limit switches,potentiometer,and flex sensors emulating the tongue in an artificial oral cavity.The system produces the sounds of vowels,consonants,and words in English,along with the pronunciation of meanings of their translations in four major Indian languages,all from oral cavity mapping.The experimental setup also caters to gender mapping of voice.The sound produced from the hardware has been validated by a perceptual test to verify the gender and word of the speech sample by listeners,with∼98%and∼95%accuracy,respectively.Such a model may be useful to interpret speech for those who are speech-disabled because of accidents,neuron disorder,spinal cord injury,or larynx disorder.展开更多
文摘In recent years, the accuracy of speech recognition (SR) has been one of the most active areas of research. Despite that SR systems are working reasonably well in quiet conditions, they still suffer severe performance degradation in noisy conditions or distorted channels. It is necessary to search for more robust feature extraction methods to gain better performance in adverse conditions. This paper investigates the performance of conventional and new hybrid speech feature extraction algorithms of Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Coding Coefficient (LPCC), perceptual linear production (PLP), and RASTA-PLP in noisy conditions through using multivariate Hidden Markov Model (HMM) classifier. The behavior of the proposal system is evaluated using TIDIGIT human voice dataset corpora, recorded from 208 different adult speakers in both training and testing process. The theoretical basis for speech processing and classifier procedures were presented, and the recognition results were obtained based on word recognition rate.
基金This work is supported by National Natural Science Foundation of China !(69972046)the NSF of Zhejiang Province! (698076)
文摘A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering process derived from a RTLA of vocal tract system according to the acoustic mechanism of speech production. The vocal-tract area function which controls the synthesis model is derived from the first three formant trajectories by using the inverse solution of speech production. The proposed method not only gives good naturalness and dynamic smoothness, but also is capable to control or modify speech timbres easily and flexibly. Further and mores it needs less number of control parameters and very low update rate of the parameters.
文摘The experiment presented in this research is targeting a 'positional' stage of a 'modular' model of speech production originally proposed by Levelt (1989), Bock & Levelt (1994), where selected lemmas are inserted into syntactic frames. Results suggest a difference between L1 and L2 English speakers at the positional stage. While this might suggest that the speech planning process is different in native and non-native speakers, an alternative view is also proposed that the observed differences are the result of differences in the way that linguistic forms are stored, rather than a fundamental difference in the way that speech is planned. This result indicates main verb, copula be & local dependency effect are the three elements that affect the realization of English subject-verb agreement, and helps us locate the phase where L2 subject-verb agreement errors happen.
文摘语音生成与获取是一个涉及大脑诸多部位的复杂认知过程,这个过程包括一种从依照句法和语义组织句子或短语的表述一直延伸到音素产生的分层结构。DIVA(directions into velocities of artculators)模型,是一种关于语音生成与获取后描述相关处理过程的数学模型,也是一种为了生成单词、音节或者音素,用来控制模拟声道运动的自适应网络模型。在当今真正具有生物学意义的语音生成和获取的神经网络模型中,DIVA模型的定义和测试相对而言是最彻底的,并且是唯一一种应用伪逆控制方案的模型。文中引入基于零空间的伪逆算法,对DIVA模型中的伪逆控制求解算法进行改进,从而更加精确地获得了DIVA模型的相应参数,提高了DIVA模型的鲁棒性。
基金Supported partly by the Promoting Science and Technology by the Japan Ministry of Education,Culture,Sports,Science and Technology and the SCOPE of the Ministry of Internal Affairs and Communications (MIC),Japan (No.071705001)
文摘A three-dimensional (3-D) physiological articulatory model was developed to account for the biomechanical properties of the speech organs in speech production. Control of the model to investigate the mechanism of speech production requires an efficient control module to estimate muscle activation patterns, which is used to manipulate the 3-D physiological articulatory model, according to the desired articulatory posture. For this purpose, a feedforward control strategy was developed by mapping the articulatory target to the corresponding muscle activation pattern via the intrinsic representation of vowel articulation. In this process, the articulatory postures are first mapped to the corresponding intrinsic representations; then, the articulatory postures are clustered in the intrinsic representations space and a nonlinear function is approximated for each cluster to map the intrinsic representation of vowel articulation to the muscle activation pattern by using general regression neural networks (GRNN). The results show that the feedforward control module is able to manipulate the 3-D physiological articulatory model for vowel production with high accuracy both acoustically and articulatorily.
基金The authors would like to acknowledge theMinistry of Electronics and Informa-tion Technology(MeitY)Government of India for financial support through the scholarship for Palli Padmini,during research work through Visvesvaraya Ph.D.Scheme for Electronics and IT.
文摘The present system experimentally demonstrates a synthesis of syllables and words from tongue manoeuvers in multiple languages,captured by four oral sensors only.For an experimental demonstration of the system used in the oral cavity,a prototype tooth model was used.Based on the principle developed in a previous publication by the author(s),the proposed system has been implemented using the oral cavity(tongue,teeth,and lips)features alone,without the glottis and the larynx.The positions of the sensors in the proposed system were optimized based on articulatory(oral cavity)gestures estimated by simulating the mechanism of human speech.The system has been tested for all English alphabets and several words with sensor-based input along with an experimental demonstration of the developed algorithm,with limit switches,potentiometer,and flex sensors emulating the tongue in an artificial oral cavity.The system produces the sounds of vowels,consonants,and words in English,along with the pronunciation of meanings of their translations in four major Indian languages,all from oral cavity mapping.The experimental setup also caters to gender mapping of voice.The sound produced from the hardware has been validated by a perceptual test to verify the gender and word of the speech sample by listeners,with∼98%and∼95%accuracy,respectively.Such a model may be useful to interpret speech for those who are speech-disabled because of accidents,neuron disorder,spinal cord injury,or larynx disorder.