Speaker separation in complex acoustic environment is one of challenging tasks in speech separation.In practice,speakers are very often unmoving or moving slowly in normal communication.In this case,the spatial featur...Speaker separation in complex acoustic environment is one of challenging tasks in speech separation.In practice,speakers are very often unmoving or moving slowly in normal communication.In this case,the spatial features among the consecutive speech frames become highly correlated such that it is helpful for speaker separation by providing additional spatial information.To fully exploit this information,we design a separation system on Recurrent Neural Network(RNN)with long short-term memory(LSTM)which effectively learns the temporal dynamics of spatial features.In detail,a LSTM-based speaker separation algorithm is proposed to extract the spatial features in each time-frequency(TF)unit and form the corresponding feature vector.Then,we treat speaker separation as a supervised learning problem,where a modified ideal ratio mask(IRM)is defined as the training function during LSTM learning.Simulations show that the proposed system achieves attractive separation performance in noisy and reverberant environments.Specifically,during the untrained acoustic test with limited priors,e.g.,unmatched signal to noise ratio(SNR)and reverberation,the proposed LSTM based algorithm can still outperforms the existing DNN based method in the measures of PESQ and STOI.It indicates our method is more robust in untrained conditions.展开更多
In this paper, we applied RobustICA to speech separation and made a comprehensive comparison to FastICA according to the separation results. Through a series of speech signal separation test, RobustICA reduced the sep...In this paper, we applied RobustICA to speech separation and made a comprehensive comparison to FastICA according to the separation results. Through a series of speech signal separation test, RobustICA reduced the separation time consumed by FastICA with higher stability, and speeches separated by RobustICA were proved to having lower separation errors. In the 14 groups of speech separation tests, separation time consumed by RobustICA was 3.185 s less than FastICA by nearly 68%. Separation errors of FastICA had a float between 0.004 and 0.02, while the errors of RobustlCA remained around 0.003. Furthermore, compared to FastICA, RobustlCA showed better separation robustness. Experimental results showed that RohustICA was successful to apply to the speech signal separation, and showed superiority to FastlCA in speech separation.展开更多
Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio(SNR)environments,and thus achieve unsatisfactory results.In this study,a co...Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio(SNR)environments,and thus achieve unsatisfactory results.In this study,a convolutional neural network with temporal convolution and residual network(TC-ResNet)is proposed to realize speech separation in a complex acoustic environment.A simplified steered-response power phase transform,denoted as GSRP-PHAT,is employed to reduce the computational cost.The extracted features are reshaped to a special tensor as the system inputs and implements temporal convolution,which not only enlarges the receptive field of the convolution layer but also significantly reduces the network computational cost.Residual blocks are used to combine multiresolution features and accelerate the training procedure.A modified ideal ratio mask is applied as the training target.Simulation results demonstrate that the proposed microphone array speech separation algorithm based on TC-ResNet achieves a better performance in terms of distortion ratio,source-to-interference ratio,and short-time objective intelligibility in low SNR and high reverberant environments,particularly in untrained situations.This indicates that the proposed method has generalization to untrained conditions.展开更多
In the design of hearing aids(HA),the real-time speech-enhancement is done.The digital hearing aids should provide high signal-to-noise ratio,gain improvement and should eliminate feedback.In generic hearing aids the ...In the design of hearing aids(HA),the real-time speech-enhancement is done.The digital hearing aids should provide high signal-to-noise ratio,gain improvement and should eliminate feedback.In generic hearing aids the perfor-mance towards different frequencies varies and non uniform.Existing noise can-cellation and speech separation methods drops the voice magnitude under the noise environment.The performance of the HA for frequency response is non uni-form.Existing noise suppression methods reduce the required signal strength also.So,the performance of uniform sub band analysis is poor when hearing aid is con-cern.In this paper,a speech separation method using Non-negative Matrix Fac-torization(NMF)algorithm is proposed for wavelet decomposition.The Proposed non-uniformfilter-bank was validated by parameters like band power,Signal-to-noise ratio(SNR),Mean Square Error(MSE),Signal to Noise and Dis-tortion Ratio(SINAD),Spurious-free dynamic range(SFDR),error and time.The speech recordings before and after separation was evaluated for quality using objective speech quality measures International Telecommunication Union-Telecommunication standard ITU-T P.862.展开更多
Deep Attractor Network(DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory(BLSTM), but the complexity of the DANet model is very high. In this paper, a ...Deep Attractor Network(DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory(BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network(BGRU) instead of BLSTM. The Gaussian Mixture Model(GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio(SDR), Signal to Interference Ratio(SIR), Signal to Artifact Ratio(SAR), and Perceptual Evaluation Speech Quality(PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.展开更多
Speech separation is an active research topic that plays an important role in numerous applications,such as speaker recognition,hearing pros-thesis,and autonomous robots.Many algorithms have been put forward to improv...Speech separation is an active research topic that plays an important role in numerous applications,such as speaker recognition,hearing pros-thesis,and autonomous robots.Many algorithms have been put forward to improve separation performance.However,speech separation in reverberant noisy environment is still a challenging task.To address this,a novel speech separation algorithm using gate recurrent unit(GRU)network based on microphone array has been proposed in this paper.The main aim of the proposed algorithm is to improve the separation performance and reduce the computational cost.The proposed algorithm extracts the sub-band steered response power-phase transform(SRP-PHAT)weighted by gammatone filter as the speech separation feature due to its discriminative and robust spatial position in formation.Since the GRU net work has the advantage of processing time series data with faster training speed and fewer training parameters,the GRU model is adopted to process the separation featuresof several sequential frames in the same sub-band to estimate the ideal Ratio Masking(IRM).The proposed algorithm decomposes the mixture signals into time-frequency(TF)units using gammatone filter bank in the frequency domain,and the target speech is reconstructed in the frequency domain by masking the mixture signal according to the estimated IRM.The operations of decomposing the mixture signal and reconstructing the target signal are completed in the frequency domain which can reduce the total computational cost.Experimental results demonstrate that the proposed algorithm realizes omnidirectional speech sep-aration in noisy and reverberant environments,provides good performance in terms of speech quality and intelligibility,and has the generalization capacity to reverberate.展开更多
Much recent progress in monaural speech separation(MSS)has been achieved through a series of deep learning architectures based on autoencoders,which use an encoder to condense the input signal into compressed features...Much recent progress in monaural speech separation(MSS)has been achieved through a series of deep learning architectures based on autoencoders,which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest.However,these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech.In this study,we propose a novel weighted-factor autoencoder(WFAE)model for MSS,which introduces a regularization loss in the objective function to isolate one source without containing other sources.By incorporating a latent attention mechanism and a supervised source constructor in the separation layer,WFAE can learn source-specific generative factors and a set of discriminative features for each source,leading to MSS performance improvement.Experiments on benchmark datasets show that our approach outperforms the existing methods.In terms of three important metrics,WFAE has great success on a relatively challenging MSS case,i.e.,speaker-independent MSS.展开更多
This letter proposes a new method for concurrent voiced speech separation. Firstly the Wrapped Discrete Fourier Transform (WDFT) is used to decompose the harmonic spectra of the mixed speeches. Then the individual spe...This letter proposes a new method for concurrent voiced speech separation. Firstly the Wrapped Discrete Fourier Transform (WDFT) is used to decompose the harmonic spectra of the mixed speeches. Then the individual speech is reconstructed by using the sinusoidal speech model. By taking advantage of the non-uniform frequency resolution of WDFT, harmonic spectra parameters can be estimated and separated accurately. Experimental results on mixed vowels separation show that the proposed method can recover the original speeches effectively.展开更多
In this paper,we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.We first discuss models such as recurrent neural networks(RNNs) a...In this paper,we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.We first discuss models such as recurrent neural networks(RNNs) and convolutional neural networks(CNNs) that can effectively exploit variablelength contextual information,and their various combination with other models.We then describe models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system,the connectionist temporal classification(CTC) criterion,and the attention-based sequenceto-sequence translation model.We further illustrate robustness issues in speech recognition systems,and discuss acoustic model adaptation,speech enhancement and separation,and robust training strategies.We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.展开更多
Based on the 1 550 nm all-fiber pulsed laser Doppler vibrometer(LDV) system independently developed by our laboratory, empirical mode decomposition(EMD) and optimally modified Log-spectral amplitude estimator(OM-LSA) ...Based on the 1 550 nm all-fiber pulsed laser Doppler vibrometer(LDV) system independently developed by our laboratory, empirical mode decomposition(EMD) and optimally modified Log-spectral amplitude estimator(OM-LSA) algorithms are associated to separate the speech micro-vibration from the target macro motion. This combined algorithm compensates for the weakness of the EMD algorithm in denoising and the inability of the OM-LSA algorithm on signal separation, achieving separation and simultaneous acquisition of the macro motion and speech micro-vibration of a target. The experimental results indicate that using this combined algorithm, the LDV system can functionally operate within 30 m and gain a 4.21 d B promotion in the signal-to-noise ratio(SNR) relative to a traditional OM-LSA algorithm.展开更多
Based on brain-inspired computing frameworks,neuromorphic systems implement large-scale neural networks in hardware.Although rapid advances have been made in the development of artificial neurons and synapses in recen...Based on brain-inspired computing frameworks,neuromorphic systems implement large-scale neural networks in hardware.Although rapid advances have been made in the development of artificial neurons and synapses in recent years,further research is beyond these individual components and focuses on neuronal circuit motifs with specialized excitatory-inhibitory(E-I)connectivity patterns.In this study,we demonstrate a core processor that can be used to construct commonly used neuronal circuits.The neuron,featuring an ultracompact physical configuration,integrates a volatile threshold switch with a gate-modulated two-dimensional(2D)MoS_(2) field-effect channel to process complex E-I spatiotemporal spiking signals.Consequently,basic neuronal circuits are constructed for biorealistic neuromorphic computing.For practical applications,an algorithm-hardware co-design is implemented in a gatecontrolled spiking neural network with substantial performance improvement in human speech separation.展开更多
基金This work is supported by the National Nature Science Foundation of China(NSFC)under Grant Nos.61571106,61501169,41706103the Fundamental Research Funds for the Central Universities under Grant No.2242013K30010.
文摘Speaker separation in complex acoustic environment is one of challenging tasks in speech separation.In practice,speakers are very often unmoving or moving slowly in normal communication.In this case,the spatial features among the consecutive speech frames become highly correlated such that it is helpful for speaker separation by providing additional spatial information.To fully exploit this information,we design a separation system on Recurrent Neural Network(RNN)with long short-term memory(LSTM)which effectively learns the temporal dynamics of spatial features.In detail,a LSTM-based speaker separation algorithm is proposed to extract the spatial features in each time-frequency(TF)unit and form the corresponding feature vector.Then,we treat speaker separation as a supervised learning problem,where a modified ideal ratio mask(IRM)is defined as the training function during LSTM learning.Simulations show that the proposed system achieves attractive separation performance in noisy and reverberant environments.Specifically,during the untrained acoustic test with limited priors,e.g.,unmatched signal to noise ratio(SNR)and reverberation,the proposed LSTM based algorithm can still outperforms the existing DNN based method in the measures of PESQ and STOI.It indicates our method is more robust in untrained conditions.
基金National Natural Science Foundation of Chinagrant number:61271082,61201029,61102094
文摘In this paper, we applied RobustICA to speech separation and made a comprehensive comparison to FastICA according to the separation results. Through a series of speech signal separation test, RobustICA reduced the separation time consumed by FastICA with higher stability, and speeches separated by RobustICA were proved to having lower separation errors. In the 14 groups of speech separation tests, separation time consumed by RobustICA was 3.185 s less than FastICA by nearly 68%. Separation errors of FastICA had a float between 0.004 and 0.02, while the errors of RobustlCA remained around 0.003. Furthermore, compared to FastICA, RobustlCA showed better separation robustness. Experimental results showed that RohustICA was successful to apply to the speech signal separation, and showed superiority to FastlCA in speech separation.
基金This work is supported by the National Key Research and Development Program of China under Grant 2020YFC2004003 and Grant 2020YFC2004002the National Nature Science Foundation of China(NSFC)under Grant No.61571106.
文摘Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio(SNR)environments,and thus achieve unsatisfactory results.In this study,a convolutional neural network with temporal convolution and residual network(TC-ResNet)is proposed to realize speech separation in a complex acoustic environment.A simplified steered-response power phase transform,denoted as GSRP-PHAT,is employed to reduce the computational cost.The extracted features are reshaped to a special tensor as the system inputs and implements temporal convolution,which not only enlarges the receptive field of the convolution layer but also significantly reduces the network computational cost.Residual blocks are used to combine multiresolution features and accelerate the training procedure.A modified ideal ratio mask is applied as the training target.Simulation results demonstrate that the proposed microphone array speech separation algorithm based on TC-ResNet achieves a better performance in terms of distortion ratio,source-to-interference ratio,and short-time objective intelligibility in low SNR and high reverberant environments,particularly in untrained situations.This indicates that the proposed method has generalization to untrained conditions.
文摘In the design of hearing aids(HA),the real-time speech-enhancement is done.The digital hearing aids should provide high signal-to-noise ratio,gain improvement and should eliminate feedback.In generic hearing aids the perfor-mance towards different frequencies varies and non uniform.Existing noise can-cellation and speech separation methods drops the voice magnitude under the noise environment.The performance of the HA for frequency response is non uni-form.Existing noise suppression methods reduce the required signal strength also.So,the performance of uniform sub band analysis is poor when hearing aid is con-cern.In this paper,a speech separation method using Non-negative Matrix Fac-torization(NMF)algorithm is proposed for wavelet decomposition.The Proposed non-uniformfilter-bank was validated by parameters like band power,Signal-to-noise ratio(SNR),Mean Square Error(MSE),Signal to Noise and Dis-tortion Ratio(SINAD),Spurious-free dynamic range(SFDR),error and time.The speech recordings before and after separation was evaluated for quality using objective speech quality measures International Telecommunication Union-Telecommunication standard ITU-T P.862.
文摘Deep Attractor Network(DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory(BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network(BGRU) instead of BLSTM. The Gaussian Mixture Model(GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio(SDR), Signal to Interference Ratio(SIR), Signal to Artifact Ratio(SAR), and Perceptual Evaluation Speech Quality(PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.
基金This work is supported by Nanjing Institute of Technology(NIT)fund for Research Startup Projects of Introduced talents under Grant No.YKJ202019Nature Sci-ence Research Project of Higher Education Institutions in Jiangsu Province under Grant No.21KJB510018+1 种基金National Nature Science Foundation of China(NSFC)under Grant No.62001215NIT fund for Doctoral Research Projects under Grant No.ZKJ2020003.
文摘Speech separation is an active research topic that plays an important role in numerous applications,such as speaker recognition,hearing pros-thesis,and autonomous robots.Many algorithms have been put forward to improve separation performance.However,speech separation in reverberant noisy environment is still a challenging task.To address this,a novel speech separation algorithm using gate recurrent unit(GRU)network based on microphone array has been proposed in this paper.The main aim of the proposed algorithm is to improve the separation performance and reduce the computational cost.The proposed algorithm extracts the sub-band steered response power-phase transform(SRP-PHAT)weighted by gammatone filter as the speech separation feature due to its discriminative and robust spatial position in formation.Since the GRU net work has the advantage of processing time series data with faster training speed and fewer training parameters,the GRU model is adopted to process the separation featuresof several sequential frames in the same sub-band to estimate the ideal Ratio Masking(IRM).The proposed algorithm decomposes the mixture signals into time-frequency(TF)units using gammatone filter bank in the frequency domain,and the target speech is reconstructed in the frequency domain by masking the mixture signal according to the estimated IRM.The operations of decomposing the mixture signal and reconstructing the target signal are completed in the frequency domain which can reduce the total computational cost.Experimental results demonstrate that the proposed algorithm realizes omnidirectional speech sep-aration in noisy and reverberant environments,provides good performance in terms of speech quality and intelligibility,and has the generalization capacity to reverberate.
基金the Key Project of the National Natural Science Foundation of China(No.U1836220)the National Natural Science Foundation of China(No.61672267)+1 种基金the Qing Lan Talent Program of Jiangsu Province,Chinathe Key Innovation Project of Undergraduate Students in Jiangsu Province,China(No.201810299045Z)。
文摘Much recent progress in monaural speech separation(MSS)has been achieved through a series of deep learning architectures based on autoencoders,which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest.However,these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech.In this study,we propose a novel weighted-factor autoencoder(WFAE)model for MSS,which introduces a regularization loss in the objective function to isolate one source without containing other sources.By incorporating a latent attention mechanism and a supervised source constructor in the separation layer,WFAE can learn source-specific generative factors and a set of discriminative features for each source,leading to MSS performance improvement.Experiments on benchmark datasets show that our approach outperforms the existing methods.In terms of three important metrics,WFAE has great success on a relatively challenging MSS case,i.e.,speaker-independent MSS.
基金Supported by the National Natural Science Foundation of China (No.60172048).
文摘This letter proposes a new method for concurrent voiced speech separation. Firstly the Wrapped Discrete Fourier Transform (WDFT) is used to decompose the harmonic spectra of the mixed speeches. Then the individual speech is reconstructed by using the sinusoidal speech model. By taking advantage of the non-uniform frequency resolution of WDFT, harmonic spectra parameters can be estimated and separated accurately. Experimental results on mixed vowels separation show that the proposed method can recover the original speeches effectively.
文摘In this paper,we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.We first discuss models such as recurrent neural networks(RNNs) and convolutional neural networks(CNNs) that can effectively exploit variablelength contextual information,and their various combination with other models.We then describe models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system,the connectionist temporal classification(CTC) criterion,and the attention-based sequenceto-sequence translation model.We further illustrate robustness issues in speech recognition systems,and discuss acoustic model adaptation,speech enhancement and separation,and robust training strategies.We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.
基金supported by the National Natural Science Foundation of China (No.61805234)the Key Research Program of Frontier Science,CAS (No.QYZDB-SSWSLH014)the Foundation of State Key Laboratory of Laser Interaction with Matter (No.SKLLIM1704)。
文摘Based on the 1 550 nm all-fiber pulsed laser Doppler vibrometer(LDV) system independently developed by our laboratory, empirical mode decomposition(EMD) and optimally modified Log-spectral amplitude estimator(OM-LSA) algorithms are associated to separate the speech micro-vibration from the target macro motion. This combined algorithm compensates for the weakness of the EMD algorithm in denoising and the inability of the OM-LSA algorithm on signal separation, achieving separation and simultaneous acquisition of the macro motion and speech micro-vibration of a target. The experimental results indicate that using this combined algorithm, the LDV system can functionally operate within 30 m and gain a 4.21 d B promotion in the signal-to-noise ratio(SNR) relative to a traditional OM-LSA algorithm.
基金National Natural Science Foundation of China,Grant/Award Numbers:92264106,U22A2076,62090034,DT23F0401,DT23F04008,DT23F04009Young Scientists Fund of the National Natural Science Foundation of China,Grant/Award Number:62204219。
文摘Based on brain-inspired computing frameworks,neuromorphic systems implement large-scale neural networks in hardware.Although rapid advances have been made in the development of artificial neurons and synapses in recent years,further research is beyond these individual components and focuses on neuronal circuit motifs with specialized excitatory-inhibitory(E-I)connectivity patterns.In this study,we demonstrate a core processor that can be used to construct commonly used neuronal circuits.The neuron,featuring an ultracompact physical configuration,integrates a volatile threshold switch with a gate-modulated two-dimensional(2D)MoS_(2) field-effect channel to process complex E-I spatiotemporal spiking signals.Consequently,basic neuronal circuits are constructed for biorealistic neuromorphic computing.For practical applications,an algorithm-hardware co-design is implemented in a gatecontrolled spiking neural network with substantial performance improvement in human speech separation.