This work demonstrates the use of the nonlinear time-frequency distribution (NLTFD) of a discrete time energy operator (DTEO) based on amplitude modulation-frequency modulation demodulation techniques as a feature i...This work demonstrates the use of the nonlinear time-frequency distribution (NLTFD) of a discrete time energy operator (DTEO) based on amplitude modulation-frequency modulation demodulation techniques as a feature in speech recognition. The duration distribution based hidden Markov module in a speaker independent large vocabulary mandarin speech recognition system was reconstructed from the feature vectors in the front-end detection stage. The goal was to improve the performance of the existing system by combining new features to the baseline feature vector. This paper also deals with errors associated with using a pre-emphasis filter in the front end processing of the present scheme, which causes an increase in the noise energy at high frequencies above 4 kHz and in some cases degrades the recognition accuracy. The experimental results show that eliminating the pre-emphasis filters from the pre-processing stage and using NLTFD with compensated DTEO combined with Mel frequency cepstrum components give a 21.95% reduction in the relative error rate compared to the conventional technique with 25 candidates used in the test.展开更多
This work describes an improved feature extractor algorithm to extract the peripheral features of point x(ti,fj) using a nonlinear algorithm to compute the nonlinear time spectrum (NL-TS) pattern. The algo- rithm ob...This work describes an improved feature extractor algorithm to extract the peripheral features of point x(ti,fj) using a nonlinear algorithm to compute the nonlinear time spectrum (NL-TS) pattern. The algo- rithm observes n×n neighborhoods of the point in all directions, and then incorporates the peripheral fea- tures using the Mel frequency cepstrum components (MFCCs)-based feature extractor of the Tsinghua elec- tronic engineering speech processing (THEESP) for Mandarin automatic speech recognition (MASR) sys- tem as replacements of the dynamic features with different feature combinations. In this algorithm, the or- thogonal bases are extracted directly from the speech data using discrite cosime transformation (DCT) with 3×3 blocks on an NL-TS pattern as the peripheral features. The new primal bases are then selected and simplified in the form of the ?dp- operator in the time direction and the ?dp- operator in the frequency di- t f rection. The algorithm has 23.29% improvements of the relative error rate in comparison with the standard MFCC feature-set and the dynamic features in tests using THEESP with the duration distribution-based hid- den Markov model (DDBHMM) based on MASR system.展开更多
In this paper we address the problem of audio-visual speech recognition in the framework of the multi-stream hidden Markov model. Stream weight training based on minimum classification error criterion is dis...In this paper we address the problem of audio-visual speech recognition in the framework of the multi-stream hidden Markov model. Stream weight training based on minimum classification error criterion is discussed for use in large vocabulary continuous speech recognition (LVCSR). We present the lattice re- scoring and Viterbi approaches for calculating the loss function of continuous speech. The experimental re- sults show that in the case of clean audio, the system performance can be improved by 36.1% in relative word error rate reduction when using state-based stream weights trained by a Viterbi approach, compared to an audio only speech recognition system. Further experimental results demonstrate that our audio-visual LVCSR system provides significant enhancement of robustness in noisy environments.展开更多
This paper presents a new discriminative approach for training Gaussian mixture models(GMMs)of hidden Markov models(HMMs)based acoustic model in a large vocabulary continuous speech recognition(LVCSR)system.This appro...This paper presents a new discriminative approach for training Gaussian mixture models(GMMs)of hidden Markov models(HMMs)based acoustic model in a large vocabulary continuous speech recognition(LVCSR)system.This approach is featured by embedding a rival penalized competitive learning(RPCL)mechanism on the level of hidden Markov states.For every input,the correct identity state,called winner and obtained by the Viterbi force alignment,is enhanced to describe this input while its most competitive rival is penalized by de-learning,which makes GMMs-based states become more discriminative.Without the extensive computing burden required by typical discriminative learning methods for one-pass recognition of the training set,the new approach saves computing costs considerably.Experiments show that the proposed method has a good convergence with better performances than the classical maximum likelihood estimation(MLE)based method.Comparing with two conventional discriminative methods,the proposed method demonstrates improved generalization ability,especially when the test set is not well matched with the training set.展开更多
基金the National High- Tech Research andDevelopm ent Program of China(No. 2 0 0 1AA114 0 71)
文摘This work demonstrates the use of the nonlinear time-frequency distribution (NLTFD) of a discrete time energy operator (DTEO) based on amplitude modulation-frequency modulation demodulation techniques as a feature in speech recognition. The duration distribution based hidden Markov module in a speaker independent large vocabulary mandarin speech recognition system was reconstructed from the feature vectors in the front-end detection stage. The goal was to improve the performance of the existing system by combining new features to the baseline feature vector. This paper also deals with errors associated with using a pre-emphasis filter in the front end processing of the present scheme, which causes an increase in the noise energy at high frequencies above 4 kHz and in some cases degrades the recognition accuracy. The experimental results show that eliminating the pre-emphasis filters from the pre-processing stage and using NLTFD with compensated DTEO combined with Mel frequency cepstrum components give a 21.95% reduction in the relative error rate compared to the conventional technique with 25 candidates used in the test.
基金Supported by the National High-Tech Research and Development (863) Program of China (No. 200/AA/14)
文摘This work describes an improved feature extractor algorithm to extract the peripheral features of point x(ti,fj) using a nonlinear algorithm to compute the nonlinear time spectrum (NL-TS) pattern. The algo- rithm observes n×n neighborhoods of the point in all directions, and then incorporates the peripheral fea- tures using the Mel frequency cepstrum components (MFCCs)-based feature extractor of the Tsinghua elec- tronic engineering speech processing (THEESP) for Mandarin automatic speech recognition (MASR) sys- tem as replacements of the dynamic features with different feature combinations. In this algorithm, the or- thogonal bases are extracted directly from the speech data using discrite cosime transformation (DCT) with 3×3 blocks on an NL-TS pattern as the peripheral features. The new primal bases are then selected and simplified in the form of the ?dp- operator in the time direction and the ?dp- operator in the frequency di- t f rection. The algorithm has 23.29% improvements of the relative error rate in comparison with the standard MFCC feature-set and the dynamic features in tests using THEESP with the duration distribution-based hid- den Markov model (DDBHMM) based on MASR system.
基金Supported by the National High-Tech Research and Development (863) Program of China (No. 863-306-ZD03-01-2)
文摘In this paper we address the problem of audio-visual speech recognition in the framework of the multi-stream hidden Markov model. Stream weight training based on minimum classification error criterion is discussed for use in large vocabulary continuous speech recognition (LVCSR). We present the lattice re- scoring and Viterbi approaches for calculating the loss function of continuous speech. The experimental re- sults show that in the case of clean audio, the system performance can be improved by 36.1% in relative word error rate reduction when using state-based stream weights trained by a Viterbi approach, compared to an audio only speech recognition system. Further experimental results demonstrate that our audio-visual LVCSR system provides significant enhancement of robustness in noisy environments.
基金The work was supported in part by the National Natural Science Foundation of China(Grant No.90920302)the National Key Basic Research Program of China(No.2009CB825404)+2 种基金the HGJ Grant(No.2011ZX01042-001-001)a research program from Microsoft China,and by a GRF grant from the Research Grant Council of Hong Kong SAR(CUHK 4180/10E)Lei XU is also supported by Chang Jiang Scholars Program,Chinese Ministry of Education for Chang Jiang Chair Professorship in Peking University.
文摘This paper presents a new discriminative approach for training Gaussian mixture models(GMMs)of hidden Markov models(HMMs)based acoustic model in a large vocabulary continuous speech recognition(LVCSR)system.This approach is featured by embedding a rival penalized competitive learning(RPCL)mechanism on the level of hidden Markov states.For every input,the correct identity state,called winner and obtained by the Viterbi force alignment,is enhanced to describe this input while its most competitive rival is penalized by de-learning,which makes GMMs-based states become more discriminative.Without the extensive computing burden required by typical discriminative learning methods for one-pass recognition of the training set,the new approach saves computing costs considerably.Experiments show that the proposed method has a good convergence with better performances than the classical maximum likelihood estimation(MLE)based method.Comparing with two conventional discriminative methods,the proposed method demonstrates improved generalization ability,especially when the test set is not well matched with the training set.