The hidden danger of the automatic speaker verification(ASV)system is various spoofed speeches.These threats can be classified into two categories,namely logical access(LA)and physical access(PA).To improve identifica...The hidden danger of the automatic speaker verification(ASV)system is various spoofed speeches.These threats can be classified into two categories,namely logical access(LA)and physical access(PA).To improve identification capability of spoofed speech detection,this paper considers the research on features.Firstly,following the idea of modifying the constant-Q-based features,this work considered adding variance or mean to the constant-Q-based cepstral domain to obtain good performance.Secondly,linear frequency cepstral coefficients(LFCCs)performed comparably with constant-Q-based features.Finally,we proposed linear frequency variance-based cepstral coefficients(LVCCs)and linear frequency mean-based cepstral coefficients(LMCCs)for identification of speech spoofing.LVCCs and LMCCs could be attained by adding the frame variance or the mean to the log magnitude spectrum based on LFCC features.The proposed novel features were evaluated on ASVspoof 2019 datase.The experimental results show that compared with known hand-crafted features,LVCCs and LMCCs are more effective in resisting spoofed speech attack.展开更多
This article presents an exhaustive comparative investigation into the accuracy of gender identification across diverse geographical regions,employing a deep learning classification algorithm for speech signal analysi...This article presents an exhaustive comparative investigation into the accuracy of gender identification across diverse geographical regions,employing a deep learning classification algorithm for speech signal analysis.In this study,speech samples are categorized for both training and testing purposes based on their geographical origin.Category 1 comprises speech samples from speakers outside of India,whereas Category 2 comprises live-recorded speech samples from Indian speakers.Testing speech samples are likewise classified into four distinct sets,taking into consideration both geographical origin and the language spoken by the speakers.Significantly,the results indicate a noticeable difference in gender identification accuracy among speakers from different geographical areas.Indian speakers,utilizing 52 Hindi and 26 English phonemes in their speech,demonstrate a notably higher gender identification accuracy of 85.75%compared to those speakers who predominantly use 26 English phonemes in their conversations when the system is trained using speech samples from Indian speakers.The gender identification accuracy of the proposed model reaches 83.20%when the system is trained using speech samples from speakers outside of India.In the analysis of speech signals,Mel Frequency Cepstral Coefficients(MFCCs)serve as relevant features for the speech data.The deep learning classification algorithm utilized in this research is based on a Bidirectional Long Short-Term Memory(BiLSTM)architecture within a Recurrent Neural Network(RNN)model.展开更多
Purpose–The safe operation of the metro power transformer directly relates to the safety and efficiency of the entire metro system.Through voiceprint technology,the sounds emitted by the transformer can be monitored ...Purpose–The safe operation of the metro power transformer directly relates to the safety and efficiency of the entire metro system.Through voiceprint technology,the sounds emitted by the transformer can be monitored in real-time,thereby achieving real-time monitoring of the transformer’s operational status.However,the environment surrounding power transformers is filled with various interfering sounds that intertwine with both the normal operational voiceprints and faulty voiceprints of the transformer,severely impacting the accuracy and reliability of voiceprint identification.Therefore,effective preprocessing steps are required to identify and separate the sound signals of transformer operation,which is a prerequisite for subsequent analysis.Design/methodology/approach–This paper proposes an Adaptive Threshold Repeating Pattern Extraction Technique(REPET)algorithm to separate and denoise the transformer operation sound signals.By analyzing the Short-Time Fourier Transform(STFT)amplitude spectrum,the algorithm identifies and utilizes the repeating periodic structures within the signal to automatically adjust the threshold,effectively distinguishing and extracting stable background signals from transient foreground events.The REPET algorithm first calculates the autocorrelation matrix of the signal to determine the repeating period,then constructs a repeating segment model.Through comparison with the amplitude spectrum of the original signal,repeating patterns are extracted and a soft time-frequency mask is generated.Findings–After adaptive thresholding processing,the target signal is separated.Experiments conducted on mixed sounds to separate background sounds from foreground sounds using this algorithm and comparing the results with those obtained using the FastICA algorithm demonstrate that the Adaptive Threshold REPET method achieves good separation effects.Originality/value–A REPET method with adaptive threshold is proposed,which adopts the dynamic threshold adjustment mechanism,adaptively calculates the threshold for blind source separation and improves the adaptability and robustness of the algorithm to the statistical characteristics of the signal.It also lays the foundation for transformer fault detection based on acoustic fingerprinting.展开更多
The rise of the digital economy and the comfort of accessing by way of user mobile devices expedite human endeavors in financial transactions over the Virtual Private Network(VPN)backbone.This prominent application of...The rise of the digital economy and the comfort of accessing by way of user mobile devices expedite human endeavors in financial transactions over the Virtual Private Network(VPN)backbone.This prominent application of VPN evades the hurdles involved in physical money exchange.The VPN acts as a gateway for the authorized user in accessing the banking server to provide mutual authentication between the user and the server.The security in the cloud authentication server remains vulnerable to the results of threat in JP Morgan Data breach in 2014,Capital One Data Breach in 2019,and manymore cloud server attacks over and over again.These attacks necessitate the demand for a strong framework for authentication to secure from any class of threat.This research paper,propose a framework with a base of EllipticalCurve Cryptography(ECC)to performsecure financial transactions throughVirtual PrivateNetwork(VPN)by implementing strongMulti-Factor Authentication(MFA)using authentication credentials and biometric identity.The research results prove that the proposed model is to be an ideal scheme for real-time implementation.The security analysis reports that the proposed model exhibits high level of security with a minimal response time of 12 s on an average of 1000 users.展开更多
In this paper,we present a comparison of Khasi speech representations with four different spectral features and novel extension towards the development of Khasi speech corpora.These four features include linear predic...In this paper,we present a comparison of Khasi speech representations with four different spectral features and novel extension towards the development of Khasi speech corpora.These four features include linear predictive coding(LPC),linear prediction cepstrum coefficient(LPCC),perceptual linear prediction(PLP),and Mel frequency cepstral coefficient(MFCC).The 10-hour speech data were used for training and 3-hour data for testing.For each spectral feature,different hidden Markov model(HMM)based recognizers with variations in HMM states and different Gaussian mixture models(GMMs)were built.The performance was evaluated by using the word error rate(WER).The experimental results show that MFCC provides a better representation for Khasi speech compared with the other three spectral features.展开更多
Hilbert-Huang transform method has been widely utilized from its inception because of the superiority in varieties of areas. The Hilbert spectrum thus obtained is able to reflect the distribution of the signal energy ...Hilbert-Huang transform method has been widely utilized from its inception because of the superiority in varieties of areas. The Hilbert spectrum thus obtained is able to reflect the distribution of the signal energy in a number of scales accurately. In this paper, a novel feature called ECC is proposed via feature extraction of the Hilbert energy spectrum which describes the distribution of the instantaneous energy. The experimental results conspicuously demonstrate that ECC outperforms the traditional short-term average energy. Combination of the ECC with mel frequency cepstral coefficients (MFCC) delineates the distribution of energy in the time domain and frequency domain, and the features of this group achieve a better recognition effect compared with the feature combination of the short-term average energy, pitch and MFCC. Afterwards, further improvements of ECC are developed. TECC is gained by combining ECC with the teager energy operator, and EFCC is obtained by introducing the instantaneous frequency to the energy. In the experiments, seven status of emotion are selected to be recognized and the highest recognition rate 83.57% is achieved within the classification accuracy of boredom reaching 100%. The numerical results indicate that the proposed features ECC, TECC and EFCC can improve the performance of speech emotion recognition substantially.展开更多
Infants portray suggestive unique cries while sick, having belly pain, discomfort, tiredness, attention and desire for a change of diapers among other needs. There exists limited knowledge in accessing the infants’ n...Infants portray suggestive unique cries while sick, having belly pain, discomfort, tiredness, attention and desire for a change of diapers among other needs. There exists limited knowledge in accessing the infants’ needs as they only relay information through suggestive cries. Many teenagers tend to give birth at an early age, thereby exposing them to be the key monitors of their own babies. They tend not to have sufficient skills in monitoring the infant’s dire needs, more so during the early stages of infant development. Artificial intelligence has shown promising efficient predictive analytics from supervised, and unsupervised to reinforcement learning models. This study, therefore, seeks to develop an android app that could be used to discriminate the infant audio cries by leveraging the strength of convolution neural networks as a classifier model. Audio analytics from many kinds of literature is an untapped area by researchers as it’s attributed to messy and huge data generation. This study, therefore, strongly leverages convolution neural networks, a deep learning model that is capable of handling more than one-dimensional datasets. To achieve this, the audio data in form of a wave was converted to images through Mel spectrum frequencies which were classified using the computer vision CNN model. The Librosa library was used to convert the audio to Mel spectrum which was then presented as pixels serving as the input for classifying the audio classes such as sick, burping, tired, and hungry. The study goal was to incorporate the model as an android tool that can be utilized at the domestic level and hospital facilities for surveillance of the infant’s health and social needs status all time round.展开更多
The interaction between humans and machines has become an issue of concern in recent years.Besides facial expressions or gestures,speech has been evidenced as one of the foremost promising modalities for automatic emo...The interaction between humans and machines has become an issue of concern in recent years.Besides facial expressions or gestures,speech has been evidenced as one of the foremost promising modalities for automatic emotion recognition.Effective computing means to support HCI(Human-Computer Interaction)at a psychological level,allowing PCs to adjust their reactions as per human requirements.Therefore,the recognition of emotion is pivotal in High-level interactions.Each Emotion has distinctive properties that form us to recognize them.The acoustic signal produced for identical expression or sentence changes is essentially a direct result of biophysical changes,(for example,the stress instigated narrowing of the larynx)set off by emotions.This connection between acoustic cues and emotions made Speech Emotion Recognition one of the moving subjects of the emotive computing area.The most motivation behind a Speech Emotion Recognition algorithm is to observe the emotional condition of a speaker from recorded Speech signals.The results from the application of k-NN and OVA-SVM for MFCC features without and with a feature selection approach are presented in this research.The MFCC features from the audio signal were initially extracted to characterize the properties of emotional speech.Secondly,nine basic statistical measures were calculated from MFCC and 117-dimensional features were consequently obtained to train the classifiers for seven different classes(Anger,Happiness,Disgust,Fear,Sadness,Disgust,Boredom and Neutral)of emotions.Next,Classification was done in four steps.First,all the 117-features are classified using both classifiers.Second,the best classifier was found and then features were scaled to[-1,1]and classified.In the third step,the with or without feature scaling which gives better performance was derived from the results of the second step and the classification was done for each of the basic statistical measures separately.Finally,in the fourth step,the combination of statistical measures which gives better performance was derived using the forward feature selection method Experiments were carried out using k-NN with different k values and a linear OVA-based SVM classifier with different optimal values.Berlin emotional speech database for the German language was utilized for testing the planned methodology and recognition rates as high as 60%accomplished for the recognition of emotion from voice signal for the set of statistical measures(median,maximum,mean,Inter-quartile range,skewness).OVA-SVM performs better than k-NN and the use of the feature selection technique gives a high rate.展开更多
This Automatic Speech Recognition (ASR) is the process which converts an acoustic signal captured by the microphone to written text. The motivation of the paper is to create a speech based Integrated Development Envir...This Automatic Speech Recognition (ASR) is the process which converts an acoustic signal captured by the microphone to written text. The motivation of the paper is to create a speech based Integrated Development Environment (IDE) for C program. This paper proposes a technique to facilitate the visually impaired people or the person with arm injuries with excellent programming skills that can code the C program through voice input. The proposed system accepts the C program as voice input and produces compiled C program as output. The user should utter each line of the C program through voice input. First the voice input is recognized as text. The recognized text will be converted into C program by using syntactic constructs of the C language. After conversion, C program will be fetched as input to the IDE. Furthermore, the IDE commands like open, save, close, compile, run are also given through voice input only. If any error occurs during the compilation process, the error is corrected through voice input only. The errors can be corrected by specifying the line number through voice input. Performance of the speech recognition system is analyzed by varying the vocabulary size as well as number of mixture components in HMM.展开更多
Due to the presence of non-stationarities and discontinuities in the audio signal, segmentation and classification of audio signal is a really challenging task. Automatic music classification and annotation is still c...Due to the presence of non-stationarities and discontinuities in the audio signal, segmentation and classification of audio signal is a really challenging task. Automatic music classification and annotation is still considered as a challenging task due to the difficulty of extracting and selecting the optimal audio features. Hence, this paper proposes an efficient approach for segmentation, feature extraction and classification of audio signals. Enhanced Mel Frequency Cepstral Coefficient (EMFCC)-Enhanced Power Normalized Cepstral Coefficients (EPNCC) based feature extraction is applied for the extraction of features from the audio signal. Then, multi-level classification is done to classify the audio signal as a musical or non-musical signal. The proposed approach achieves better performance in terms of precision, Normalized Mutual Information (NMI), F-score and entropy. The PNN classifier shows high False Rejection Rate (FRR), False Acceptance Rate (FAR), Genuine Acceptance rate (GAR), sensitivity, specificity and accuracy with respect to the number of classes.展开更多
Identification of bird species from their sounds has become an important area in biodiversity-related research due to the relative ease of capturing bird sounds in the commonly challenging habitat. Audio features have...Identification of bird species from their sounds has become an important area in biodiversity-related research due to the relative ease of capturing bird sounds in the commonly challenging habitat. Audio features have a massive impact on the classification task since they are the fundamental elements used to differentiate classes. As such, the extraction of informative properties of the data is a crucial stage of any classification-based application. Therefore, it is vital to identify the most significant feature to represent the actual bird sounds. In this paper, we propose a novel feature that can advance classification accuracy with modified features, which are most suitable for classifying birds from its audio sounds. Modified Gammatone frequency cepstral coefficient(GTCC) features have been extracted with their frequency banks adjusted to suit bird sounds. The features are then used to train and test a support vector machine(SVM) classifier. It has been shown that the modified GTCC features are able to give 86% accuracy with twenty Bornean birds. Furthermore, in this paper, we are proposing a novel probability enhanced entropy(PEE) feature, which, when combined with the modified GTCC features, is able to improve accuracy further to 89.5%. These results are significant as the relatively low-resource intensive SVM with the proposed modified GTCC, and the proposed novel PEE feature can be implemented in a real-time system to assist researchers,scientists, conservationists, and even eco-tourists in identifying bird species in the dense forest.展开更多
基金National Natural Science Foundation of China(No.62001100)。
文摘The hidden danger of the automatic speaker verification(ASV)system is various spoofed speeches.These threats can be classified into two categories,namely logical access(LA)and physical access(PA).To improve identification capability of spoofed speech detection,this paper considers the research on features.Firstly,following the idea of modifying the constant-Q-based features,this work considered adding variance or mean to the constant-Q-based cepstral domain to obtain good performance.Secondly,linear frequency cepstral coefficients(LFCCs)performed comparably with constant-Q-based features.Finally,we proposed linear frequency variance-based cepstral coefficients(LVCCs)and linear frequency mean-based cepstral coefficients(LMCCs)for identification of speech spoofing.LVCCs and LMCCs could be attained by adding the frame variance or the mean to the log magnitude spectrum based on LFCC features.The proposed novel features were evaluated on ASVspoof 2019 datase.The experimental results show that compared with known hand-crafted features,LVCCs and LMCCs are more effective in resisting spoofed speech attack.
文摘This article presents an exhaustive comparative investigation into the accuracy of gender identification across diverse geographical regions,employing a deep learning classification algorithm for speech signal analysis.In this study,speech samples are categorized for both training and testing purposes based on their geographical origin.Category 1 comprises speech samples from speakers outside of India,whereas Category 2 comprises live-recorded speech samples from Indian speakers.Testing speech samples are likewise classified into four distinct sets,taking into consideration both geographical origin and the language spoken by the speakers.Significantly,the results indicate a noticeable difference in gender identification accuracy among speakers from different geographical areas.Indian speakers,utilizing 52 Hindi and 26 English phonemes in their speech,demonstrate a notably higher gender identification accuracy of 85.75%compared to those speakers who predominantly use 26 English phonemes in their conversations when the system is trained using speech samples from Indian speakers.The gender identification accuracy of the proposed model reaches 83.20%when the system is trained using speech samples from speakers outside of India.In the analysis of speech signals,Mel Frequency Cepstral Coefficients(MFCCs)serve as relevant features for the speech data.The deep learning classification algorithm utilized in this research is based on a Bidirectional Long Short-Term Memory(BiLSTM)architecture within a Recurrent Neural Network(RNN)model.
基金the China Academy of Railway Sciences Corporation Limited(2023YJ257).
文摘Purpose–The safe operation of the metro power transformer directly relates to the safety and efficiency of the entire metro system.Through voiceprint technology,the sounds emitted by the transformer can be monitored in real-time,thereby achieving real-time monitoring of the transformer’s operational status.However,the environment surrounding power transformers is filled with various interfering sounds that intertwine with both the normal operational voiceprints and faulty voiceprints of the transformer,severely impacting the accuracy and reliability of voiceprint identification.Therefore,effective preprocessing steps are required to identify and separate the sound signals of transformer operation,which is a prerequisite for subsequent analysis.Design/methodology/approach–This paper proposes an Adaptive Threshold Repeating Pattern Extraction Technique(REPET)algorithm to separate and denoise the transformer operation sound signals.By analyzing the Short-Time Fourier Transform(STFT)amplitude spectrum,the algorithm identifies and utilizes the repeating periodic structures within the signal to automatically adjust the threshold,effectively distinguishing and extracting stable background signals from transient foreground events.The REPET algorithm first calculates the autocorrelation matrix of the signal to determine the repeating period,then constructs a repeating segment model.Through comparison with the amplitude spectrum of the original signal,repeating patterns are extracted and a soft time-frequency mask is generated.Findings–After adaptive thresholding processing,the target signal is separated.Experiments conducted on mixed sounds to separate background sounds from foreground sounds using this algorithm and comparing the results with those obtained using the FastICA algorithm demonstrate that the Adaptive Threshold REPET method achieves good separation effects.Originality/value–A REPET method with adaptive threshold is proposed,which adopts the dynamic threshold adjustment mechanism,adaptively calculates the threshold for blind source separation and improves the adaptability and robustness of the algorithm to the statistical characteristics of the signal.It also lays the foundation for transformer fault detection based on acoustic fingerprinting.
文摘The rise of the digital economy and the comfort of accessing by way of user mobile devices expedite human endeavors in financial transactions over the Virtual Private Network(VPN)backbone.This prominent application of VPN evades the hurdles involved in physical money exchange.The VPN acts as a gateway for the authorized user in accessing the banking server to provide mutual authentication between the user and the server.The security in the cloud authentication server remains vulnerable to the results of threat in JP Morgan Data breach in 2014,Capital One Data Breach in 2019,and manymore cloud server attacks over and over again.These attacks necessitate the demand for a strong framework for authentication to secure from any class of threat.This research paper,propose a framework with a base of EllipticalCurve Cryptography(ECC)to performsecure financial transactions throughVirtual PrivateNetwork(VPN)by implementing strongMulti-Factor Authentication(MFA)using authentication credentials and biometric identity.The research results prove that the proposed model is to be an ideal scheme for real-time implementation.The security analysis reports that the proposed model exhibits high level of security with a minimal response time of 12 s on an average of 1000 users.
基金supported by the Visvesvaraya Ph.D.Scheme for Electronics and IT students launched by the Ministry of Electronics and Information Technology(MeiTY),Government of India under Grant No.PhD-MLA/4(95)/2015-2016.
文摘In this paper,we present a comparison of Khasi speech representations with four different spectral features and novel extension towards the development of Khasi speech corpora.These four features include linear predictive coding(LPC),linear prediction cepstrum coefficient(LPCC),perceptual linear prediction(PLP),and Mel frequency cepstral coefficient(MFCC).The 10-hour speech data were used for training and 3-hour data for testing.For each spectral feature,different hidden Markov model(HMM)based recognizers with variations in HMM states and different Gaussian mixture models(GMMs)were built.The performance was evaluated by using the word error rate(WER).The experimental results show that MFCC provides a better representation for Khasi speech compared with the other three spectral features.
基金Project supported by the State Key Laboratory of Robotics and System(Grant No.SKLS-2009-MS-10)the Shanghai Leading Academic Discipline Project(Grant No.J50103)
文摘Hilbert-Huang transform method has been widely utilized from its inception because of the superiority in varieties of areas. The Hilbert spectrum thus obtained is able to reflect the distribution of the signal energy in a number of scales accurately. In this paper, a novel feature called ECC is proposed via feature extraction of the Hilbert energy spectrum which describes the distribution of the instantaneous energy. The experimental results conspicuously demonstrate that ECC outperforms the traditional short-term average energy. Combination of the ECC with mel frequency cepstral coefficients (MFCC) delineates the distribution of energy in the time domain and frequency domain, and the features of this group achieve a better recognition effect compared with the feature combination of the short-term average energy, pitch and MFCC. Afterwards, further improvements of ECC are developed. TECC is gained by combining ECC with the teager energy operator, and EFCC is obtained by introducing the instantaneous frequency to the energy. In the experiments, seven status of emotion are selected to be recognized and the highest recognition rate 83.57% is achieved within the classification accuracy of boredom reaching 100%. The numerical results indicate that the proposed features ECC, TECC and EFCC can improve the performance of speech emotion recognition substantially.
文摘Infants portray suggestive unique cries while sick, having belly pain, discomfort, tiredness, attention and desire for a change of diapers among other needs. There exists limited knowledge in accessing the infants’ needs as they only relay information through suggestive cries. Many teenagers tend to give birth at an early age, thereby exposing them to be the key monitors of their own babies. They tend not to have sufficient skills in monitoring the infant’s dire needs, more so during the early stages of infant development. Artificial intelligence has shown promising efficient predictive analytics from supervised, and unsupervised to reinforcement learning models. This study, therefore, seeks to develop an android app that could be used to discriminate the infant audio cries by leveraging the strength of convolution neural networks as a classifier model. Audio analytics from many kinds of literature is an untapped area by researchers as it’s attributed to messy and huge data generation. This study, therefore, strongly leverages convolution neural networks, a deep learning model that is capable of handling more than one-dimensional datasets. To achieve this, the audio data in form of a wave was converted to images through Mel spectrum frequencies which were classified using the computer vision CNN model. The Librosa library was used to convert the audio to Mel spectrum which was then presented as pixels serving as the input for classifying the audio classes such as sick, burping, tired, and hungry. The study goal was to incorporate the model as an android tool that can be utilized at the domestic level and hospital facilities for surveillance of the infant’s health and social needs status all time round.
文摘The interaction between humans and machines has become an issue of concern in recent years.Besides facial expressions or gestures,speech has been evidenced as one of the foremost promising modalities for automatic emotion recognition.Effective computing means to support HCI(Human-Computer Interaction)at a psychological level,allowing PCs to adjust their reactions as per human requirements.Therefore,the recognition of emotion is pivotal in High-level interactions.Each Emotion has distinctive properties that form us to recognize them.The acoustic signal produced for identical expression or sentence changes is essentially a direct result of biophysical changes,(for example,the stress instigated narrowing of the larynx)set off by emotions.This connection between acoustic cues and emotions made Speech Emotion Recognition one of the moving subjects of the emotive computing area.The most motivation behind a Speech Emotion Recognition algorithm is to observe the emotional condition of a speaker from recorded Speech signals.The results from the application of k-NN and OVA-SVM for MFCC features without and with a feature selection approach are presented in this research.The MFCC features from the audio signal were initially extracted to characterize the properties of emotional speech.Secondly,nine basic statistical measures were calculated from MFCC and 117-dimensional features were consequently obtained to train the classifiers for seven different classes(Anger,Happiness,Disgust,Fear,Sadness,Disgust,Boredom and Neutral)of emotions.Next,Classification was done in four steps.First,all the 117-features are classified using both classifiers.Second,the best classifier was found and then features were scaled to[-1,1]and classified.In the third step,the with or without feature scaling which gives better performance was derived from the results of the second step and the classification was done for each of the basic statistical measures separately.Finally,in the fourth step,the combination of statistical measures which gives better performance was derived using the forward feature selection method Experiments were carried out using k-NN with different k values and a linear OVA-based SVM classifier with different optimal values.Berlin emotional speech database for the German language was utilized for testing the planned methodology and recognition rates as high as 60%accomplished for the recognition of emotion from voice signal for the set of statistical measures(median,maximum,mean,Inter-quartile range,skewness).OVA-SVM performs better than k-NN and the use of the feature selection technique gives a high rate.
文摘This Automatic Speech Recognition (ASR) is the process which converts an acoustic signal captured by the microphone to written text. The motivation of the paper is to create a speech based Integrated Development Environment (IDE) for C program. This paper proposes a technique to facilitate the visually impaired people or the person with arm injuries with excellent programming skills that can code the C program through voice input. The proposed system accepts the C program as voice input and produces compiled C program as output. The user should utter each line of the C program through voice input. First the voice input is recognized as text. The recognized text will be converted into C program by using syntactic constructs of the C language. After conversion, C program will be fetched as input to the IDE. Furthermore, the IDE commands like open, save, close, compile, run are also given through voice input only. If any error occurs during the compilation process, the error is corrected through voice input only. The errors can be corrected by specifying the line number through voice input. Performance of the speech recognition system is analyzed by varying the vocabulary size as well as number of mixture components in HMM.
文摘Due to the presence of non-stationarities and discontinuities in the audio signal, segmentation and classification of audio signal is a really challenging task. Automatic music classification and annotation is still considered as a challenging task due to the difficulty of extracting and selecting the optimal audio features. Hence, this paper proposes an efficient approach for segmentation, feature extraction and classification of audio signals. Enhanced Mel Frequency Cepstral Coefficient (EMFCC)-Enhanced Power Normalized Cepstral Coefficients (EPNCC) based feature extraction is applied for the extraction of features from the audio signal. Then, multi-level classification is done to classify the audio signal as a musical or non-musical signal. The proposed approach achieves better performance in terms of precision, Normalized Mutual Information (NMI), F-score and entropy. The PNN classifier shows high False Rejection Rate (FRR), False Acceptance Rate (FAR), Genuine Acceptance rate (GAR), sensitivity, specificity and accuracy with respect to the number of classes.
文摘Identification of bird species from their sounds has become an important area in biodiversity-related research due to the relative ease of capturing bird sounds in the commonly challenging habitat. Audio features have a massive impact on the classification task since they are the fundamental elements used to differentiate classes. As such, the extraction of informative properties of the data is a crucial stage of any classification-based application. Therefore, it is vital to identify the most significant feature to represent the actual bird sounds. In this paper, we propose a novel feature that can advance classification accuracy with modified features, which are most suitable for classifying birds from its audio sounds. Modified Gammatone frequency cepstral coefficient(GTCC) features have been extracted with their frequency banks adjusted to suit bird sounds. The features are then used to train and test a support vector machine(SVM) classifier. It has been shown that the modified GTCC features are able to give 86% accuracy with twenty Bornean birds. Furthermore, in this paper, we are proposing a novel probability enhanced entropy(PEE) feature, which, when combined with the modified GTCC features, is able to improve accuracy further to 89.5%. These results are significant as the relatively low-resource intensive SVM with the proposed modified GTCC, and the proposed novel PEE feature can be implemented in a real-time system to assist researchers,scientists, conservationists, and even eco-tourists in identifying bird species in the dense forest.