An algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) is provided to perform signal feature extraction for the task of speaker accent recognition. Then different classifiers are compared based on the MFCC...An algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) is provided to perform signal feature extraction for the task of speaker accent recognition. Then different classifiers are compared based on the MFCC feature. For each signal, the mean vector of MFCC matrix is used as an input vector for pattern recognition. A sample of 330 signals, containing 165 US voice and 165 non-US voice, is analyzed. By comparison, k-nearest neighbors yield the highest average test accuracy, after using a cross-validation of size 500, and least time being used in the computation.展开更多
Speech recognition systems have become a unique human-computer interaction(HCI)family.Speech is one of the most naturally developed human abilities;speech signal processing opens up a transparent and hand-free computa...Speech recognition systems have become a unique human-computer interaction(HCI)family.Speech is one of the most naturally developed human abilities;speech signal processing opens up a transparent and hand-free computation experience.This paper aims to present a retrospective yet modern approach to the world of speech recognition systems.The development journey of ASR(Automatic Speech Recognition)has seen quite a few milestones and breakthrough technologies that have been highlighted in this paper.A step-by-step rundown of the fundamental stages in developing speech recognition systems has been presented,along with a brief discussion of various modern-day developments and applications in this domain.This review paper aims to summarize and provide a beginning point for those starting in the vast field of speech signal processing.Since speech recognition has a vast potential in various industries like telecommunication,emotion recognition,healthcare,etc.,this review would be helpful to researchers who aim at exploring more applications that society can quickly adopt in future years of evolution.展开更多
Perhaps hearing impairment individuals cannot identify the environmental sounds due to noise around them.However,very little research has been conducted in this domain.Hence,the aim of this study is to categorize soun...Perhaps hearing impairment individuals cannot identify the environmental sounds due to noise around them.However,very little research has been conducted in this domain.Hence,the aim of this study is to categorize sounds generated in the environment so that the impairment individuals can distinguish the sound categories.To that end first we define nine sound classes--air conditioner,car horn,children playing,dog bark,drilling,engine idling,jackhammer,siren,and street music--typically exist in the environment.Then we record 100 sound samples from each category and extract features of each sound category using Mel-Frequency Cepstral Coefficients(MFCC).The training dataset is developed using this set of features together with the class variable;sound category.Sound classification is a complex task and hence,we use two Deep Learning techniques;Multi Layer Perceptron(MLP)and Convolution Neural Network(CNN)to train classification models.The models are tested using a separate test set and the performances of the models are evaluated using precision,recall and F1-score.The results show that the CNN model outperforms the MLP.However,the MLP also provided a decent accuracy in classifying unknown environmental sounds.展开更多
This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant chara...This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.展开更多
The Mel-frequency cepstral coefficient (MFCC) is the most widely used feature in speech and speaker recognition. However, MFCC is very sensitive to noise interference, which tends to drastically de- grade the perfor...The Mel-frequency cepstral coefficient (MFCC) is the most widely used feature in speech and speaker recognition. However, MFCC is very sensitive to noise interference, which tends to drastically de- grade the performance of recognition systems because of the mismatches between training and testing. In this paper, the logarithmic transformation in the standard MFCC analysis is replaced by a combined function to improve the noisy sensitivity. The proposed feature extraction process is also combined with speech en- hancement methods, such as spectral subtraction and median-filter to further suppress the noise. Experi- ments show that the proposed robust MFCC-based feature significantly reduces the recognition error rate over a wide signal-to-noise ratio range.展开更多
Audiovisual speech recognition is an emerging research topic.Lipreading is the recognition of what someone is saying using visual information,primarily lip movements.In this study,we created a custom dataset for India...Audiovisual speech recognition is an emerging research topic.Lipreading is the recognition of what someone is saying using visual information,primarily lip movements.In this study,we created a custom dataset for Indian English linguistics and categorized it into three main categories:(1)audio recognition,(2)visual feature extraction,and(3)combined audio and visual recognition.Audio features were extracted using the mel-frequency cepstral coefficient,and classification was performed using a one-dimension convolutional neural network.Visual feature extraction uses Dlib and then classifies visual speech using a long short-term memory type of recurrent neural networks.Finally,integration was performed using a deep convolutional network.The audio speech of Indian English was successfully recognized with accuracies of 93.67%and 91.53%,respectively,using testing data from 200 epochs.The training accuracy for visual speech recognition using the Indian English dataset was 77.48%and the test accuracy was 76.19%using 60 epochs.After integration,the accuracies of audiovisual speech recognition using the Indian English dataset for training and testing were 94.67%and 91.75%,respectively.展开更多
An English speech recognition system was implemented on a chip, called speech system-on-chip (SoC). The SoC included an application specific integrated circuit with a vector accelerator to improve performance. The s...An English speech recognition system was implemented on a chip, called speech system-on-chip (SoC). The SoC included an application specific integrated circuit with a vector accelerator to improve performance. The sub-word model based on a continuous density hidden Markov model recognition algorithm ran on a very cheap speech chip. The algorithm was a two-stage fixed-width beam-search baseline system with a variable beam-width pruning strategy and a frame-synchronous word-level pruning strategy to significantly reduce the recognition time. Tests show that this method reduces the recognition time nearly 6 fold and the memory size nearly 2 fold compared to the original system, with less than 1% accuracy degradation for a 600 word recognition task and recognition accuracy rate of about 98%.展开更多
文摘An algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) is provided to perform signal feature extraction for the task of speaker accent recognition. Then different classifiers are compared based on the MFCC feature. For each signal, the mean vector of MFCC matrix is used as an input vector for pattern recognition. A sample of 330 signals, containing 165 US voice and 165 non-US voice, is analyzed. By comparison, k-nearest neighbors yield the highest average test accuracy, after using a cross-validation of size 500, and least time being used in the computation.
文摘Speech recognition systems have become a unique human-computer interaction(HCI)family.Speech is one of the most naturally developed human abilities;speech signal processing opens up a transparent and hand-free computation experience.This paper aims to present a retrospective yet modern approach to the world of speech recognition systems.The development journey of ASR(Automatic Speech Recognition)has seen quite a few milestones and breakthrough technologies that have been highlighted in this paper.A step-by-step rundown of the fundamental stages in developing speech recognition systems has been presented,along with a brief discussion of various modern-day developments and applications in this domain.This review paper aims to summarize and provide a beginning point for those starting in the vast field of speech signal processing.Since speech recognition has a vast potential in various industries like telecommunication,emotion recognition,healthcare,etc.,this review would be helpful to researchers who aim at exploring more applications that society can quickly adopt in future years of evolution.
文摘Perhaps hearing impairment individuals cannot identify the environmental sounds due to noise around them.However,very little research has been conducted in this domain.Hence,the aim of this study is to categorize sounds generated in the environment so that the impairment individuals can distinguish the sound categories.To that end first we define nine sound classes--air conditioner,car horn,children playing,dog bark,drilling,engine idling,jackhammer,siren,and street music--typically exist in the environment.Then we record 100 sound samples from each category and extract features of each sound category using Mel-Frequency Cepstral Coefficients(MFCC).The training dataset is developed using this set of features together with the class variable;sound category.Sound classification is a complex task and hence,we use two Deep Learning techniques;Multi Layer Perceptron(MLP)and Convolution Neural Network(CNN)to train classification models.The models are tested using a separate test set and the performances of the models are evaluated using precision,recall and F1-score.The results show that the CNN model outperforms the MLP.However,the MLP also provided a decent accuracy in classifying unknown environmental sounds.
基金Project supported by the National Natural Science Foundation of China (Grant No.60903186)the Shanghai Leading Academic Discipline Project (Grant No.J50104)
文摘This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.
基金Supported by the National Natural Science Foundation of China(No. 6007201)
文摘The Mel-frequency cepstral coefficient (MFCC) is the most widely used feature in speech and speaker recognition. However, MFCC is very sensitive to noise interference, which tends to drastically de- grade the performance of recognition systems because of the mismatches between training and testing. In this paper, the logarithmic transformation in the standard MFCC analysis is replaced by a combined function to improve the noisy sensitivity. The proposed feature extraction process is also combined with speech en- hancement methods, such as spectral subtraction and median-filter to further suppress the noise. Experi- ments show that the proposed robust MFCC-based feature significantly reduces the recognition error rate over a wide signal-to-noise ratio range.
文摘Audiovisual speech recognition is an emerging research topic.Lipreading is the recognition of what someone is saying using visual information,primarily lip movements.In this study,we created a custom dataset for Indian English linguistics and categorized it into three main categories:(1)audio recognition,(2)visual feature extraction,and(3)combined audio and visual recognition.Audio features were extracted using the mel-frequency cepstral coefficient,and classification was performed using a one-dimension convolutional neural network.Visual feature extraction uses Dlib and then classifies visual speech using a long short-term memory type of recurrent neural networks.Finally,integration was performed using a deep convolutional network.The audio speech of Indian English was successfully recognized with accuracies of 93.67%and 91.53%,respectively,using testing data from 200 epochs.The training accuracy for visual speech recognition using the Indian English dataset was 77.48%and the test accuracy was 76.19%using 60 epochs.After integration,the accuracies of audiovisual speech recognition using the Indian English dataset for training and testing were 94.67%and 91.75%,respectively.
基金Supported by the National Natural Science Foundation of China and Microsoft Research Asia(No. 60776800)the National Natural Science Foundation of China and Research Grants Council (No.60931160443)the National High-Tech Research and Development (863) Program of China(Nos. 2006AA010101,2007AA04Z223,2008AA02Z414,and 2008AA040201)
文摘An English speech recognition system was implemented on a chip, called speech system-on-chip (SoC). The SoC included an application specific integrated circuit with a vector accelerator to improve performance. The sub-word model based on a continuous density hidden Markov model recognition algorithm ran on a very cheap speech chip. The algorithm was a two-stage fixed-width beam-search baseline system with a variable beam-width pruning strategy and a frame-synchronous word-level pruning strategy to significantly reduce the recognition time. Tests show that this method reduces the recognition time nearly 6 fold and the memory size nearly 2 fold compared to the original system, with less than 1% accuracy degradation for a 600 word recognition task and recognition accuracy rate of about 98%.