A three mass model of vocal cords as well as mathematical expression of the model are discussed. Different kinds of typical hoarse speech due to laryngeal diseases are simulated on microcomputer and the effects of di...A three mass model of vocal cords as well as mathematical expression of the model are discussed. Different kinds of typical hoarse speech due to laryngeal diseases are simulated on microcomputer and the effects of different pathological factors of vocal cords on model parameters are studied. Some typical spectrum distribution of the simulated speech signals are given. Moreover, hoarse speech signals of some typical cases are analyzed by the methods of digital signal processing, including FFT, LPC, Cepstrum technique, Pseudocolor encoding, etc. The experiment results show that the three mass model analysis of vocal cords is an efficient method for analysis of hoarse speech signals.展开更多
The research on finding the arrival directions of speech signals by microphone arrny is proposed. We first analyze the uniform microphone array and give the design for microphone array applied in the hand-free speech ...The research on finding the arrival directions of speech signals by microphone arrny is proposed. We first analyze the uniform microphone array and give the design for microphone array applied in the hand-free speech recognition. Combining the traditional direction finding technique of MUltiple SIgnal Classification (MUSIC) with the focusing matrix method, we improve the resolving power of the microphone array for multiple speech sources.As one application of finding Direction of Arrival (DOA), a new microphone-array system for noise reduction is proposed. The new system is based on maximum likelihood estimate technique which reconstruct superimposed signals from different directions by using DOA information. The DOA information is got in terms of focusing MUSIC method which has been proven to have high performance than conventional MUSIC method on speaker localization[1].展开更多
This paper presents a novel non-contact heart rate extraction method from vowel speech signals. The proposed method is based on modeling the relationship between speech production of vowel speech signals and heart act...This paper presents a novel non-contact heart rate extraction method from vowel speech signals. The proposed method is based on modeling the relationship between speech production of vowel speech signals and heart activities for humans where it is observed that the moment of heart beat causes a short increment (evolution) of vowel speech formants. The short-time Fourier transform (STFT) is used to detect the formant maximum peaks so as to accurately estimate the heart rate. Compared with traditional contact pulse oximeter, the average accuracy of the proposed non-contact heart rate extraction method exceeds 95%. The proposed non-contact heart rate extraction method is expected to play an important role in modern medical applications.展开更多
Classification of speech signals is a vital part of speech signal processing systems.With the advent of speech coding and synthesis,the classification of the speech signal is made accurate and faster.Conventional meth...Classification of speech signals is a vital part of speech signal processing systems.With the advent of speech coding and synthesis,the classification of the speech signal is made accurate and faster.Conventional methods are considered inaccurate due to the uncertainty and diversity of speech signals in the case of real speech signal classification.In this paper,we use efficient speech signal classification using a series of neural network classifiers with reinforcement learning operations.Prior classification of speech signals,the study extracts the essential features from the speech signal using Cepstral Analysis.The features are extracted by converting the speech waveform to a parametric representation to obtain a relatively minimized data rate.Hence to improve the precision of classification,Generative Adversarial Networks are used and it tends to classify the speech signal after the extraction of features from the speech signal using the cepstral coefficient.The classifiers are trained with these features initially and the best classifier is chosen to perform the task of classification on new datasets.The validation of testing sets is evaluated using RL that provides feedback to Classifiers.Finally,at the user interface,the signals are played by decoding the signal after being retrieved from the classifier back based on the input query.The results are evaluated in the form of accuracy,recall,precision,f-measure,and error rate,where generative adversarial network attains an increased accuracy rate than other methods:Multi-Layer Perceptron,Recurrent Neural Networks,Deep belief Networks,and Convolutional Neural Networks.展开更多
Support vector machines (SVMs) are utilized for emotion recognition in Chinese speech in this paper. Both binary class discrimination and the multi class discrimination are discussed. It proves that the emotional fe...Support vector machines (SVMs) are utilized for emotion recognition in Chinese speech in this paper. Both binary class discrimination and the multi class discrimination are discussed. It proves that the emotional features construct a nonlinear problem in the input space, and SVMs based on nonlinear mapping can solve it more effectively than other linear methods. Multi class classification based on SVMs with a soft decision function is constructed to classify the four emotion situations. Compared with principal component analysis (PCA) method and modified PCA method, SVMs perform the best result in multi class discrimination by using nonlinear kernel mapping.展开更多
Structural and statistical characteristics of signals can improve the performance of Compressed Sensing (CS). Two kinds of features of Discrete Cosine Transform (DCT) coefficients of voiced speech signals are discusse...Structural and statistical characteristics of signals can improve the performance of Compressed Sensing (CS). Two kinds of features of Discrete Cosine Transform (DCT) coefficients of voiced speech signals are discussed in this paper. The first one is the block sparsity of DCT coefficients of voiced speech formulated from two different aspects which are the distribution of the DCT coefficients of voiced speech and the comparison of reconstruction performance between the mixed program and Basis Pursuit (BP). The block sparsity of DCT coefficients of voiced speech means that some algorithms of block-sparse CS can be used to improve the recovery performance of speech signals. It is proved by the simulation results of the mixed program which is an improved version of the mixed program. The second one is the well known large DCT coefficients of voiced speech focus on low frequency. In line with this feature, a special Gaussian and Partial Identity Joint (GPIJ) matrix is constructed as the sensing matrix for voiced speech signals. Simulation results show that the GPIJ matrix outperforms the classical Gaussian matrix for speech signals of male and female adults.展开更多
The perceptual effect of the phase information in speech has been studied by auditorysubjective tests. On the condition that the phase spectrum in speech is changed while amplitudespectrum is unchanged, the tests show...The perceptual effect of the phase information in speech has been studied by auditorysubjective tests. On the condition that the phase spectrum in speech is changed while amplitudespectrum is unchanged, the tests show that: (1) If the envelop of the reconstructed speech signalis unchanged, there is indistinctive auditory perception between the original speech and thereconstructed speech; (2) The auditory perception effect of the reconstructed speech mainly lieson the amplitude of the derivative of the additive phase; (3) td is the maximum relative time shiftbetween different frequency components of the reconstructed speech signal. The speech qualityis excellent while td <10ms; good while 10ms< td <20ms; common while 20ms< td <35ms, andpoor while td >35ms.展开更多
Parkinson’s disease(PD),one of whose symptoms is dysphonia,is a prevalent neurodegenerative disease.The use of outdated diagnosis techniques,which yield inaccurate and unreliable results,continues to represent an obs...Parkinson’s disease(PD),one of whose symptoms is dysphonia,is a prevalent neurodegenerative disease.The use of outdated diagnosis techniques,which yield inaccurate and unreliable results,continues to represent an obstacle in early-stage detection and diagnosis for clinical professionals in the medical field.To solve this issue,the study proposes using machine learning and deep learning models to analyze processed speech signals of patients’voice recordings.Datasets of these processed speech signals were obtained and experimented on by random forest and logistic regression classifiers.Results were highly successful,with 90%accuracy produced by the random forest classifier and 81.5%by the logistic regression classifier.Furthermore,a deep neural network was implemented to investigate if such variation in method could add to the findings.It proved to be effective,as the neural network yielded an accuracy of nearly 92%.Such results suggest that it is possible to accurately diagnose early-stage PD through merely testing patients’voices.This research calls for a revolutionary diagnostic approach in decision support systems,and is the first step in a market-wide implementation of healthcare software dedicated to the aid of clinicians in early diagnosis of PD.展开更多
In this paper, the frequency-domain Frost algorithm is enhanced by using conjugate gradient techniques for speech enhancement. Unlike the non-adaptive approach of computing the optimum minimum variance distortionless ...In this paper, the frequency-domain Frost algorithm is enhanced by using conjugate gradient techniques for speech enhancement. Unlike the non-adaptive approach of computing the optimum minimum variance distortionless response (MVDR) solution with the correlation matrix inversion, the Frost algorithm implementing the stochastic constrained least mean square (LMS) algorithm can adaptively converge to the MVDR solution in mean-square sense, but with a very slow convergence rate. In this paper, we propose a frequency-domain constrained conjugate gradient (FDCCG) algorithm to speed up the convergence. The devised FDCCG algorithm avoids the matrix inversion and exhibits fast convergence. The speech enhancement experiments for the target speech signal corrupted by two and five interfering speech signals are demonstrated by using a four-channel acoustic-vector-sensor (AVS) micro-phone array and show the superior performance.展开更多
In order to apply speech recognition systems to actual circumstances such as inspection and maintenance operations in industrial factories to recording and reporting routines at construction sites, etc. where hand-wri...In order to apply speech recognition systems to actual circumstances such as inspection and maintenance operations in industrial factories to recording and reporting routines at construction sites, etc. where hand-writing is difficult, some countermeasure methods for surrounding noise are indispensable. In this study, a signal detection method to remove the noise for actual speech signals is proposed by using Bayesian estimation with the aid of bone-conducted speech. More specifically, by introducing Bayes’ theorem based on the observation of air-conducted speech contaminated by surrounding background noise, a new type of algorithm for noise removal is theoretically derived. In the proposed speech detection method, bone-conducted speech is utilized in order to obtain precise estimation for speech signals. The effectiveness of the proposed method is experimentally confirmed by applying it to air- and bone-conducted speeches measured in real environment under the existence of surrounding background noise.展开更多
Research on the feature of speech and image signals are carried out from two perspectives,the time domain and the frequency domain.The speech and image signals are a non-stationary signal,so FT is not used for the non...Research on the feature of speech and image signals are carried out from two perspectives,the time domain and the frequency domain.The speech and image signals are a non-stationary signal,so FT is not used for the non-stationary characteristics of the signal.When short-term stable speech is obtained by windowing and framing the subsequent processing of the signal is completed by the Discrete Fourier Transform(DFT).The Fast Discrete Fourier Transform is a commonly used analysis method for speech and image signal processing in frequency domain.It has the problem of adjusting window size to a for desired resolution.But the Fractional Fourier Transform can have both time domain and frequency domain processing capabilities.This paper performs global processing speech encryption by combining speech with image of Fractional Fourier Transform.The speech signal is embedded watermark image that is processed by fractional transformation,and the embedded watermark has the effect of rotation and superposition,which improves the security of the speech.The paper results show that the proposed speech encryption method has a higher security level by Fractional Fourier Transform.The technology is easy to extend to practical applications.展开更多
Speech recognition rate will deteriorate greatly in human-machine interaction when the speaker's speech mixes with a bystander's voice. This paper proposes a time-frequency approach for Blind Source Seperation...Speech recognition rate will deteriorate greatly in human-machine interaction when the speaker's speech mixes with a bystander's voice. This paper proposes a time-frequency approach for Blind Source Seperation (BSS) for intelligent Human-Machine Interaction(HMI). Main idea of the algorithm is to simultaneously diagonalize the correlation matrix of the pre-whitened signals at different time delays for every frequency bins in time-frequency domain. The prososed method has two merits: (1) fast convergence speed; (2) high signal to interference ratio of the separated signals. Numerical evaluations are used to compare the performance of the proposed algorithm with two other deconvolution algorithms. An efficient algorithm to resolve permutation ambiguity is also proposed in this paper. The algorithm proposed saves more than 10% of computational time with properly selected parameters and achieves good performances for both simulated convolutive mixtures and real room recorded speeches.展开更多
Compressed sensing,a new area of signal processing rising in recent years,seeks to minimize the number of samples that is necessary to be taken from a signal for precise reconstruction.The precondition of compressed s...Compressed sensing,a new area of signal processing rising in recent years,seeks to minimize the number of samples that is necessary to be taken from a signal for precise reconstruction.The precondition of compressed sensing theory is the sparsity of signals.In this paper,two methods to estimate the sparsity level of the signal are formulated.And then an approach to estimate the sparsity level directly from the noisy signal is presented.Moreover,a scheme based on distributed compressed sensing for speech signal denoising is described in this work which exploits multiple measurements of the noisy speech signal to construct the block-sparse data and then reconstruct the original speech signal using block-sparse model-based Compressive Sampling Matching Pursuit(CoSaMP) algorithm.Several simulation results demonstrate the accuracy of the estimated sparsity level and that this de-noising system for noisy speech signals can achieve favorable performance especially when speech signals suffer severe noise.展开更多
An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate s...An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate such deaf speakers through certain assistive mechanism, an effort has been taken to understand the acoustic characteristics of deaf speakers by evaluating the territory specific utterances. Speech signals are acquired from 32 normal and 32 deaf speakers by uttering ten Indian native Tamil language words. The speech parameters like pitch, formants, signal-to-noise ratio, energy, intensity, jitter and shimmer are analyzed. From the results, it has been observed that the acoustic characteristics of deaf speakers differ significantly and their quantitative measure dominates the normal speakers for the words considered. The study also reveals that the informative part of speech in a normal and deaf speakers may be identified using the acoustic features. In addition, these attributes may be used for differential corrections of deaf speaker’s speech signal and facilitate listeners to understand the conveyed information.展开更多
Speech signals play an essential role in communication and provide an efficient way to exchange information between humans and machines.Speech Emotion Recognition(SER)is one of the critical sources for human evaluatio...Speech signals play an essential role in communication and provide an efficient way to exchange information between humans and machines.Speech Emotion Recognition(SER)is one of the critical sources for human evaluation,which is applicable in many real-world applications such as healthcare,call centers,robotics,safety,and virtual reality.This work developed a novel TCN-based emotion recognition system using speech signals through a spatial-temporal convolution network to recognize the speaker’s emotional state.The authors designed a Temporal Convolutional Network(TCN)core block to recognize long-term dependencies in speech signals and then feed these temporal cues to a dense network to fuse the spatial features and recognize global information for final classification.The proposed network extracts valid sequential cues automatically from speech signals,which performed better than state-of-the-art(SOTA)and traditional machine learning algorithms.Results of the proposed method show a high recognition rate compared with SOTAmethods.The final unweighted accuracy of 80.84%,and 92.31%,for interactive emotional dyadic motion captures(IEMOCAP)and berlin emotional dataset(EMO-DB),indicate the robustness and efficiency of the designed model.展开更多
The Laboratory of Acoustics,Speech and Signal Processing(LASSP),theunique and superior national key laboratory of ASSP in China,has been foundedat the Inst.of Acoustics,Academia Sinica,Beijing PRC.After three years of...The Laboratory of Acoustics,Speech and Signal Processing(LASSP),theunique and superior national key laboratory of ASSP in China,has been foundedat the Inst.of Acoustics,Academia Sinica,Beijing PRC.After three years ofefforts,the construction of the LASSP has been completed successfully and thecertain capability of performing frontier research projects in fundamental theory andapplied technology of sound field and acoustic signal processing has ben formed.A fiexible and complete experimental acoustic signal processing system hasbeen set up in the LASSP.With the remarkable advantage of real time signalprocessing and resource sharing,a wide range of research projects in the field ofASSP can be conducted in the laboratory.The Signal Processing Center of theLASSP is well equipped with many computer research facilities including the展开更多
Steganalysis can be used to classify an object whether or not it contains hidden information. In this article, is presented, a novel approach to detect the presence of least significant bit(LSB) steganographic messa...Steganalysis can be used to classify an object whether or not it contains hidden information. In this article, is presented, a novel approach to detect the presence of least significant bit(LSB) steganographic messages in the voice secure communication system. A distance measure, which has proven to be sensitive to LSB steganography by analysis of variance (ANOVA), is denoted to estimate the difference between the host signal and the stego signal. Then an maximum likelihood (ML) decision is combined to form the classifier. Statistical experiments show that the proposed approach has a highly accurate rate and low computational complexity.展开更多
Parkinson’s disease is one of the most destructive diseases to the nervous system.Speech disorder is one of the typical symptoms of Parkinson’s disease.Approximately 90%of Parkin-son’s patients develop some degree ...Parkinson’s disease is one of the most destructive diseases to the nervous system.Speech disorder is one of the typical symptoms of Parkinson’s disease.Approximately 90%of Parkin-son’s patients develop some degree of speech disorder,which affects speech function faster than any other subsystem of the body.Screening Parkinson’s disease by sound is a very effective method that has attracted a growing number of researchers over the past decade.Patients with Parkinson’s disease could be identified by recording the sound signal of the pronunciation of words,extracting appropriate features and identifying the disturbance in their voices.This paper proposes an improved genetic algorithm combined with a data enhancement method for Parkinson’s speech signal recognition.Specifically,the methods first extract representative speech signal features through the L1 regularization SVM and then enhance the representative feature data by the SMOTE algorithm.Following this,both original and enhanced features are used to train an SVM classifier for speech signal recognition.An improved genetic algorithm was applied to find the optimal parameters of the SVM.The effectiveness of our proposed model is demonstrated by using Parkinson’s disease audio data set from the UCI machine learning library,and compared with the most advancedmethods,our proposed method has the best performance.展开更多
The 4th National Conference on Speech,Image,Communication and Signal Pro-cessing,which was sponsored by the Institute of Speech,Hearing,and Music Acoustics,Acoustical Society of China and the Institute of Signal Proce...The 4th National Conference on Speech,Image,Communication and Signal Pro-cessing,which was sponsored by the Institute of Speech,Hearing,and Music Acoustics,Acoustical Society of China and the Institute of Signal Processing,Electronic Society ofChina,was held,25—27 October,1989,at Beijing Institute of Post and Telecommun-ication.The conference drew a registration of 150 from different places in the country,which made it the largest conference in the last eight years.The president of Institute of Speech,Hearing,and Music Acoustics,ASC,professorZHANG Jialu made a openning speech at the openning session,and the honorary presi-dent of Acoustical Society of China,professor MAA Dah-You and the president of展开更多
文摘A three mass model of vocal cords as well as mathematical expression of the model are discussed. Different kinds of typical hoarse speech due to laryngeal diseases are simulated on microcomputer and the effects of different pathological factors of vocal cords on model parameters are studied. Some typical spectrum distribution of the simulated speech signals are given. Moreover, hoarse speech signals of some typical cases are analyzed by the methods of digital signal processing, including FFT, LPC, Cepstrum technique, Pseudocolor encoding, etc. The experiment results show that the three mass model analysis of vocal cords is an efficient method for analysis of hoarse speech signals.
文摘The research on finding the arrival directions of speech signals by microphone arrny is proposed. We first analyze the uniform microphone array and give the design for microphone array applied in the hand-free speech recognition. Combining the traditional direction finding technique of MUltiple SIgnal Classification (MUSIC) with the focusing matrix method, we improve the resolving power of the microphone array for multiple speech sources.As one application of finding Direction of Arrival (DOA), a new microphone-array system for noise reduction is proposed. The new system is based on maximum likelihood estimate technique which reconstruct superimposed signals from different directions by using DOA information. The DOA information is got in terms of focusing MUSIC method which has been proven to have high performance than conventional MUSIC method on speaker localization[1].
文摘This paper presents a novel non-contact heart rate extraction method from vowel speech signals. The proposed method is based on modeling the relationship between speech production of vowel speech signals and heart activities for humans where it is observed that the moment of heart beat causes a short increment (evolution) of vowel speech formants. The short-time Fourier transform (STFT) is used to detect the formant maximum peaks so as to accurately estimate the heart rate. Compared with traditional contact pulse oximeter, the average accuracy of the proposed non-contact heart rate extraction method exceeds 95%. The proposed non-contact heart rate extraction method is expected to play an important role in modern medical applications.
文摘Classification of speech signals is a vital part of speech signal processing systems.With the advent of speech coding and synthesis,the classification of the speech signal is made accurate and faster.Conventional methods are considered inaccurate due to the uncertainty and diversity of speech signals in the case of real speech signal classification.In this paper,we use efficient speech signal classification using a series of neural network classifiers with reinforcement learning operations.Prior classification of speech signals,the study extracts the essential features from the speech signal using Cepstral Analysis.The features are extracted by converting the speech waveform to a parametric representation to obtain a relatively minimized data rate.Hence to improve the precision of classification,Generative Adversarial Networks are used and it tends to classify the speech signal after the extraction of features from the speech signal using the cepstral coefficient.The classifiers are trained with these features initially and the best classifier is chosen to perform the task of classification on new datasets.The validation of testing sets is evaluated using RL that provides feedback to Classifiers.Finally,at the user interface,the signals are played by decoding the signal after being retrieved from the classifier back based on the input query.The results are evaluated in the form of accuracy,recall,precision,f-measure,and error rate,where generative adversarial network attains an increased accuracy rate than other methods:Multi-Layer Perceptron,Recurrent Neural Networks,Deep belief Networks,and Convolutional Neural Networks.
文摘Support vector machines (SVMs) are utilized for emotion recognition in Chinese speech in this paper. Both binary class discrimination and the multi class discrimination are discussed. It proves that the emotional features construct a nonlinear problem in the input space, and SVMs based on nonlinear mapping can solve it more effectively than other linear methods. Multi class classification based on SVMs with a soft decision function is constructed to classify the four emotion situations. Compared with principal component analysis (PCA) method and modified PCA method, SVMs perform the best result in multi class discrimination by using nonlinear kernel mapping.
基金Supported by the National Natural Science Foundation of China (No. 60971129)the National Research Program of China (973 Program) (No. 2011CB302303)the Scientific Innovation Research Program of College Graduate in Jiangsu Province (No. CXLX11_0408)
文摘Structural and statistical characteristics of signals can improve the performance of Compressed Sensing (CS). Two kinds of features of Discrete Cosine Transform (DCT) coefficients of voiced speech signals are discussed in this paper. The first one is the block sparsity of DCT coefficients of voiced speech formulated from two different aspects which are the distribution of the DCT coefficients of voiced speech and the comparison of reconstruction performance between the mixed program and Basis Pursuit (BP). The block sparsity of DCT coefficients of voiced speech means that some algorithms of block-sparse CS can be used to improve the recovery performance of speech signals. It is proved by the simulation results of the mixed program which is an improved version of the mixed program. The second one is the well known large DCT coefficients of voiced speech focus on low frequency. In line with this feature, a special Gaussian and Partial Identity Joint (GPIJ) matrix is constructed as the sensing matrix for voiced speech signals. Simulation results show that the GPIJ matrix outperforms the classical Gaussian matrix for speech signals of male and female adults.
基金the National Natural Science Foundation of China (No.60071029)
文摘The perceptual effect of the phase information in speech has been studied by auditorysubjective tests. On the condition that the phase spectrum in speech is changed while amplitudespectrum is unchanged, the tests show that: (1) If the envelop of the reconstructed speech signalis unchanged, there is indistinctive auditory perception between the original speech and thereconstructed speech; (2) The auditory perception effect of the reconstructed speech mainly lieson the amplitude of the derivative of the additive phase; (3) td is the maximum relative time shiftbetween different frequency components of the reconstructed speech signal. The speech qualityis excellent while td <10ms; good while 10ms< td <20ms; common while 20ms< td <35ms, andpoor while td >35ms.
文摘Parkinson’s disease(PD),one of whose symptoms is dysphonia,is a prevalent neurodegenerative disease.The use of outdated diagnosis techniques,which yield inaccurate and unreliable results,continues to represent an obstacle in early-stage detection and diagnosis for clinical professionals in the medical field.To solve this issue,the study proposes using machine learning and deep learning models to analyze processed speech signals of patients’voice recordings.Datasets of these processed speech signals were obtained and experimented on by random forest and logistic regression classifiers.Results were highly successful,with 90%accuracy produced by the random forest classifier and 81.5%by the logistic regression classifier.Furthermore,a deep neural network was implemented to investigate if such variation in method could add to the findings.It proved to be effective,as the neural network yielded an accuracy of nearly 92%.Such results suggest that it is possible to accurately diagnose early-stage PD through merely testing patients’voices.This research calls for a revolutionary diagnostic approach in decision support systems,and is the first step in a market-wide implementation of healthcare software dedicated to the aid of clinicians in early diagnosis of PD.
基金supported by the Human Sixth Sense Programme at the Advanced Digital Sciences Center from Singapore’s Agency for Science,Technology and Research
文摘In this paper, the frequency-domain Frost algorithm is enhanced by using conjugate gradient techniques for speech enhancement. Unlike the non-adaptive approach of computing the optimum minimum variance distortionless response (MVDR) solution with the correlation matrix inversion, the Frost algorithm implementing the stochastic constrained least mean square (LMS) algorithm can adaptively converge to the MVDR solution in mean-square sense, but with a very slow convergence rate. In this paper, we propose a frequency-domain constrained conjugate gradient (FDCCG) algorithm to speed up the convergence. The devised FDCCG algorithm avoids the matrix inversion and exhibits fast convergence. The speech enhancement experiments for the target speech signal corrupted by two and five interfering speech signals are demonstrated by using a four-channel acoustic-vector-sensor (AVS) micro-phone array and show the superior performance.
文摘In order to apply speech recognition systems to actual circumstances such as inspection and maintenance operations in industrial factories to recording and reporting routines at construction sites, etc. where hand-writing is difficult, some countermeasure methods for surrounding noise are indispensable. In this study, a signal detection method to remove the noise for actual speech signals is proposed by using Bayesian estimation with the aid of bone-conducted speech. More specifically, by introducing Bayes’ theorem based on the observation of air-conducted speech contaminated by surrounding background noise, a new type of algorithm for noise removal is theoretically derived. In the proposed speech detection method, bone-conducted speech is utilized in order to obtain precise estimation for speech signals. The effectiveness of the proposed method is experimentally confirmed by applying it to air- and bone-conducted speeches measured in real environment under the existence of surrounding background noise.
基金The work is supported by Regional Innovation Cooperation Project of Sichuan Province(Grant No.22QYCX0082)Jian-Guo Wei received the grant,and the Science and Technology Plan of Qinghai Province,China(Grant No.2019-ZJ-7012)Xiu Juan Ma received the grant.
文摘Research on the feature of speech and image signals are carried out from two perspectives,the time domain and the frequency domain.The speech and image signals are a non-stationary signal,so FT is not used for the non-stationary characteristics of the signal.When short-term stable speech is obtained by windowing and framing the subsequent processing of the signal is completed by the Discrete Fourier Transform(DFT).The Fast Discrete Fourier Transform is a commonly used analysis method for speech and image signal processing in frequency domain.It has the problem of adjusting window size to a for desired resolution.But the Fractional Fourier Transform can have both time domain and frequency domain processing capabilities.This paper performs global processing speech encryption by combining speech with image of Fractional Fourier Transform.The speech signal is embedded watermark image that is processed by fractional transformation,and the embedded watermark has the effect of rotation and superposition,which improves the security of the speech.The paper results show that the proposed speech encryption method has a higher security level by Fractional Fourier Transform.The technology is easy to extend to practical applications.
文摘Speech recognition rate will deteriorate greatly in human-machine interaction when the speaker's speech mixes with a bystander's voice. This paper proposes a time-frequency approach for Blind Source Seperation (BSS) for intelligent Human-Machine Interaction(HMI). Main idea of the algorithm is to simultaneously diagonalize the correlation matrix of the pre-whitened signals at different time delays for every frequency bins in time-frequency domain. The prososed method has two merits: (1) fast convergence speed; (2) high signal to interference ratio of the separated signals. Numerical evaluations are used to compare the performance of the proposed algorithm with two other deconvolution algorithms. An efficient algorithm to resolve permutation ambiguity is also proposed in this paper. The algorithm proposed saves more than 10% of computational time with properly selected parameters and achieves good performances for both simulated convolutive mixtures and real room recorded speeches.
基金Supported by the National Natural Science Foundation of China (No. 60971129)the National Research Program of China (973 Program) (No. 2011CB302303)the Scientific Innovation Research Program of College Graduate in Jiangsu Province (No. CXLX11_0408)
文摘Compressed sensing,a new area of signal processing rising in recent years,seeks to minimize the number of samples that is necessary to be taken from a signal for precise reconstruction.The precondition of compressed sensing theory is the sparsity of signals.In this paper,two methods to estimate the sparsity level of the signal are formulated.And then an approach to estimate the sparsity level directly from the noisy signal is presented.Moreover,a scheme based on distributed compressed sensing for speech signal denoising is described in this work which exploits multiple measurements of the noisy speech signal to construct the block-sparse data and then reconstruct the original speech signal using block-sparse model-based Compressive Sampling Matching Pursuit(CoSaMP) algorithm.Several simulation results demonstrate the accuracy of the estimated sparsity level and that this de-noising system for noisy speech signals can achieve favorable performance especially when speech signals suffer severe noise.
文摘An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate such deaf speakers through certain assistive mechanism, an effort has been taken to understand the acoustic characteristics of deaf speakers by evaluating the territory specific utterances. Speech signals are acquired from 32 normal and 32 deaf speakers by uttering ten Indian native Tamil language words. The speech parameters like pitch, formants, signal-to-noise ratio, energy, intensity, jitter and shimmer are analyzed. From the results, it has been observed that the acoustic characteristics of deaf speakers differ significantly and their quantitative measure dominates the normal speakers for the words considered. The study also reveals that the informative part of speech in a normal and deaf speakers may be identified using the acoustic features. In addition, these attributes may be used for differential corrections of deaf speaker’s speech signal and facilitate listeners to understand the conveyed information.
文摘Speech signals play an essential role in communication and provide an efficient way to exchange information between humans and machines.Speech Emotion Recognition(SER)is one of the critical sources for human evaluation,which is applicable in many real-world applications such as healthcare,call centers,robotics,safety,and virtual reality.This work developed a novel TCN-based emotion recognition system using speech signals through a spatial-temporal convolution network to recognize the speaker’s emotional state.The authors designed a Temporal Convolutional Network(TCN)core block to recognize long-term dependencies in speech signals and then feed these temporal cues to a dense network to fuse the spatial features and recognize global information for final classification.The proposed network extracts valid sequential cues automatically from speech signals,which performed better than state-of-the-art(SOTA)and traditional machine learning algorithms.Results of the proposed method show a high recognition rate compared with SOTAmethods.The final unweighted accuracy of 80.84%,and 92.31%,for interactive emotional dyadic motion captures(IEMOCAP)and berlin emotional dataset(EMO-DB),indicate the robustness and efficiency of the designed model.
文摘The Laboratory of Acoustics,Speech and Signal Processing(LASSP),theunique and superior national key laboratory of ASSP in China,has been foundedat the Inst.of Acoustics,Academia Sinica,Beijing PRC.After three years ofefforts,the construction of the LASSP has been completed successfully and thecertain capability of performing frontier research projects in fundamental theory andapplied technology of sound field and acoustic signal processing has ben formed.A fiexible and complete experimental acoustic signal processing system hasbeen set up in the LASSP.With the remarkable advantage of real time signalprocessing and resource sharing,a wide range of research projects in the field ofASSP can be conducted in the laboratory.The Signal Processing Center of theLASSP is well equipped with many computer research facilities including the
基金This work is supported by the Natural Science Foundation of Jiangsu Province(BK2004150);the Hi-Tech Research and Development Program of China (2006AA010102).
文摘Steganalysis can be used to classify an object whether or not it contains hidden information. In this article, is presented, a novel approach to detect the presence of least significant bit(LSB) steganographic messages in the voice secure communication system. A distance measure, which has proven to be sensitive to LSB steganography by analysis of variance (ANOVA), is denoted to estimate the difference between the host signal and the stego signal. Then an maximum likelihood (ML) decision is combined to form the classifier. Statistical experiments show that the proposed approach has a highly accurate rate and low computational complexity.
基金supported by the Youth Fund Project of the National Natural Fund of China under Grant 62002038.
文摘Parkinson’s disease is one of the most destructive diseases to the nervous system.Speech disorder is one of the typical symptoms of Parkinson’s disease.Approximately 90%of Parkin-son’s patients develop some degree of speech disorder,which affects speech function faster than any other subsystem of the body.Screening Parkinson’s disease by sound is a very effective method that has attracted a growing number of researchers over the past decade.Patients with Parkinson’s disease could be identified by recording the sound signal of the pronunciation of words,extracting appropriate features and identifying the disturbance in their voices.This paper proposes an improved genetic algorithm combined with a data enhancement method for Parkinson’s speech signal recognition.Specifically,the methods first extract representative speech signal features through the L1 regularization SVM and then enhance the representative feature data by the SMOTE algorithm.Following this,both original and enhanced features are used to train an SVM classifier for speech signal recognition.An improved genetic algorithm was applied to find the optimal parameters of the SVM.The effectiveness of our proposed model is demonstrated by using Parkinson’s disease audio data set from the UCI machine learning library,and compared with the most advancedmethods,our proposed method has the best performance.
文摘The 4th National Conference on Speech,Image,Communication and Signal Pro-cessing,which was sponsored by the Institute of Speech,Hearing,and Music Acoustics,Acoustical Society of China and the Institute of Signal Processing,Electronic Society ofChina,was held,25—27 October,1989,at Beijing Institute of Post and Telecommun-ication.The conference drew a registration of 150 from different places in the country,which made it the largest conference in the last eight years.The president of Institute of Speech,Hearing,and Music Acoustics,ASC,professorZHANG Jialu made a openning speech at the openning session,and the honorary presi-dent of Acoustical Society of China,professor MAA Dah-You and the president of