The end-to-end separation algorithm with superior performance in the field of speech separation has not been effectively used in music separation.Moreover,since music signals are often dual channel data with a high sa...The end-to-end separation algorithm with superior performance in the field of speech separation has not been effectively used in music separation.Moreover,since music signals are often dual channel data with a high sampling rate,how to model longsequence data and make rational use of the relevant information between channels is also an urgent problem to be solved.In order to solve the above problems,the performance of the end-to-end music separation algorithm is enhanced by improving the network structure.Our main contributions include the following:(1)A more reasonable densely connected U-Net is designed to capture the long-term characteristics of music,such as main melody,tone and so on.(2)On this basis,the multi-head attention and dualpath transformer are introduced in the separation module.Channel attention units are applied recursively on the feature map of each layer of the network,enabling the network to perform long-sequence separation.Experimental results show that after the introduction of the channel attention,the performance of the proposed algorithm has a stable improvement compared with the baseline system.On the MUSDB18 dataset,the average score of the separated audio exceeds that of the current best-performing music separation algorithm based on the time-frequency domain(T-F domain).展开更多
Speech separation is an active research topic that plays an important role in numerous applications,such as speaker recognition,hearing pros-thesis,and autonomous robots.Many algorithms have been put forward to improv...Speech separation is an active research topic that plays an important role in numerous applications,such as speaker recognition,hearing pros-thesis,and autonomous robots.Many algorithms have been put forward to improve separation performance.However,speech separation in reverberant noisy environment is still a challenging task.To address this,a novel speech separation algorithm using gate recurrent unit(GRU)network based on microphone array has been proposed in this paper.The main aim of the proposed algorithm is to improve the separation performance and reduce the computational cost.The proposed algorithm extracts the sub-band steered response power-phase transform(SRP-PHAT)weighted by gammatone filter as the speech separation feature due to its discriminative and robust spatial position in formation.Since the GRU net work has the advantage of processing time series data with faster training speed and fewer training parameters,the GRU model is adopted to process the separation featuresof several sequential frames in the same sub-band to estimate the ideal Ratio Masking(IRM).The proposed algorithm decomposes the mixture signals into time-frequency(TF)units using gammatone filter bank in the frequency domain,and the target speech is reconstructed in the frequency domain by masking the mixture signal according to the estimated IRM.The operations of decomposing the mixture signal and reconstructing the target signal are completed in the frequency domain which can reduce the total computational cost.Experimental results demonstrate that the proposed algorithm realizes omnidirectional speech sep-aration in noisy and reverberant environments,provides good performance in terms of speech quality and intelligibility,and has the generalization capacity to reverberate.展开更多
Speaker separation in complex acoustic environment is one of challenging tasks in speech separation.In practice,speakers are very often unmoving or moving slowly in normal communication.In this case,the spatial featur...Speaker separation in complex acoustic environment is one of challenging tasks in speech separation.In practice,speakers are very often unmoving or moving slowly in normal communication.In this case,the spatial features among the consecutive speech frames become highly correlated such that it is helpful for speaker separation by providing additional spatial information.To fully exploit this information,we design a separation system on Recurrent Neural Network(RNN)with long short-term memory(LSTM)which effectively learns the temporal dynamics of spatial features.In detail,a LSTM-based speaker separation algorithm is proposed to extract the spatial features in each time-frequency(TF)unit and form the corresponding feature vector.Then,we treat speaker separation as a supervised learning problem,where a modified ideal ratio mask(IRM)is defined as the training function during LSTM learning.Simulations show that the proposed system achieves attractive separation performance in noisy and reverberant environments.Specifically,during the untrained acoustic test with limited priors,e.g.,unmatched signal to noise ratio(SNR)and reverberation,the proposed LSTM based algorithm can still outperforms the existing DNN based method in the measures of PESQ and STOI.It indicates our method is more robust in untrained conditions.展开更多
Deep Attractor Network(DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory(BLSTM), but the complexity of the DANet model is very high. In this paper, a ...Deep Attractor Network(DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory(BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network(BGRU) instead of BLSTM. The Gaussian Mixture Model(GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio(SDR), Signal to Interference Ratio(SIR), Signal to Artifact Ratio(SAR), and Perceptual Evaluation Speech Quality(PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.展开更多
We proposed a method using latent regression Bayesian network (LRBN) toextract the shared speech feature for the input of end-to-end speech recognition model.The structure of LRBN is compact and its parameter learning...We proposed a method using latent regression Bayesian network (LRBN) toextract the shared speech feature for the input of end-to-end speech recognition model.The structure of LRBN is compact and its parameter learning is fast. Compared withConvolutional Neural Network, it has a simpler and understood structure and lessparameters to learn. Experimental results show that the advantage of hybridLRBN/Bidirectional Long Short-Term Memory-Connectionist Temporal Classificationarchitecture for Tibetan multi-dialect speech recognition, and demonstrate the LRBN ishelpful to differentiate among multiple language speech sets.展开更多
Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio(SNR)environments,and thus achieve unsatisfactory results.In this study,a co...Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio(SNR)environments,and thus achieve unsatisfactory results.In this study,a convolutional neural network with temporal convolution and residual network(TC-ResNet)is proposed to realize speech separation in a complex acoustic environment.A simplified steered-response power phase transform,denoted as GSRP-PHAT,is employed to reduce the computational cost.The extracted features are reshaped to a special tensor as the system inputs and implements temporal convolution,which not only enlarges the receptive field of the convolution layer but also significantly reduces the network computational cost.Residual blocks are used to combine multiresolution features and accelerate the training procedure.A modified ideal ratio mask is applied as the training target.Simulation results demonstrate that the proposed microphone array speech separation algorithm based on TC-ResNet achieves a better performance in terms of distortion ratio,source-to-interference ratio,and short-time objective intelligibility in low SNR and high reverberant environments,particularly in untrained situations.This indicates that the proposed method has generalization to untrained conditions.展开更多
使用深度学习技术进行语音分离已经取得了优异的成果。当前主流的语音分离模型主要基于注意力模块或卷积神经网络,它们通过许多中间状态传递信息,难以对较长的语音序列建模导致分离性能不佳。首先提出了一种端到端的双路径语音分离网络(...使用深度学习技术进行语音分离已经取得了优异的成果。当前主流的语音分离模型主要基于注意力模块或卷积神经网络,它们通过许多中间状态传递信息,难以对较长的语音序列建模导致分离性能不佳。首先提出了一种端到端的双路径语音分离网络(DPCFNet),该网络通过引入改进的密集连接块,使编码器能提取到丰富的语音特征。然后使用卷积增强Transformer(Conformer)作为分离层的主要组成部分,使语音序列中的元素可以直接交互,不再通过中间状态传递信息。最后将Conformer与双路径结构相结合使得该模型能够有效地进行长语音序列建模。实验结果表明,相比于当前主流的Conv-Tasnet算法及DPTNet算法,所提出的模型在信噪失真比(Signal to noise Distortion Ratio,SDR)和尺度不变信噪失真比(Scale-Invariant Signal to noise Distortion Ratio,SI-SDR)上有明显提高,分离性能更好。展开更多
基金National Natural Science Foundation of China,Grant/Award Number:62071039Beijing Natural Science Foundation,Grant/Award Number:L223033。
文摘The end-to-end separation algorithm with superior performance in the field of speech separation has not been effectively used in music separation.Moreover,since music signals are often dual channel data with a high sampling rate,how to model longsequence data and make rational use of the relevant information between channels is also an urgent problem to be solved.In order to solve the above problems,the performance of the end-to-end music separation algorithm is enhanced by improving the network structure.Our main contributions include the following:(1)A more reasonable densely connected U-Net is designed to capture the long-term characteristics of music,such as main melody,tone and so on.(2)On this basis,the multi-head attention and dualpath transformer are introduced in the separation module.Channel attention units are applied recursively on the feature map of each layer of the network,enabling the network to perform long-sequence separation.Experimental results show that after the introduction of the channel attention,the performance of the proposed algorithm has a stable improvement compared with the baseline system.On the MUSDB18 dataset,the average score of the separated audio exceeds that of the current best-performing music separation algorithm based on the time-frequency domain(T-F domain).
基金This work is supported by Nanjing Institute of Technology(NIT)fund for Research Startup Projects of Introduced talents under Grant No.YKJ202019Nature Sci-ence Research Project of Higher Education Institutions in Jiangsu Province under Grant No.21KJB510018+1 种基金National Nature Science Foundation of China(NSFC)under Grant No.62001215NIT fund for Doctoral Research Projects under Grant No.ZKJ2020003.
文摘Speech separation is an active research topic that plays an important role in numerous applications,such as speaker recognition,hearing pros-thesis,and autonomous robots.Many algorithms have been put forward to improve separation performance.However,speech separation in reverberant noisy environment is still a challenging task.To address this,a novel speech separation algorithm using gate recurrent unit(GRU)network based on microphone array has been proposed in this paper.The main aim of the proposed algorithm is to improve the separation performance and reduce the computational cost.The proposed algorithm extracts the sub-band steered response power-phase transform(SRP-PHAT)weighted by gammatone filter as the speech separation feature due to its discriminative and robust spatial position in formation.Since the GRU net work has the advantage of processing time series data with faster training speed and fewer training parameters,the GRU model is adopted to process the separation featuresof several sequential frames in the same sub-band to estimate the ideal Ratio Masking(IRM).The proposed algorithm decomposes the mixture signals into time-frequency(TF)units using gammatone filter bank in the frequency domain,and the target speech is reconstructed in the frequency domain by masking the mixture signal according to the estimated IRM.The operations of decomposing the mixture signal and reconstructing the target signal are completed in the frequency domain which can reduce the total computational cost.Experimental results demonstrate that the proposed algorithm realizes omnidirectional speech sep-aration in noisy and reverberant environments,provides good performance in terms of speech quality and intelligibility,and has the generalization capacity to reverberate.
基金This work is supported by the National Nature Science Foundation of China(NSFC)under Grant Nos.61571106,61501169,41706103the Fundamental Research Funds for the Central Universities under Grant No.2242013K30010.
文摘Speaker separation in complex acoustic environment is one of challenging tasks in speech separation.In practice,speakers are very often unmoving or moving slowly in normal communication.In this case,the spatial features among the consecutive speech frames become highly correlated such that it is helpful for speaker separation by providing additional spatial information.To fully exploit this information,we design a separation system on Recurrent Neural Network(RNN)with long short-term memory(LSTM)which effectively learns the temporal dynamics of spatial features.In detail,a LSTM-based speaker separation algorithm is proposed to extract the spatial features in each time-frequency(TF)unit and form the corresponding feature vector.Then,we treat speaker separation as a supervised learning problem,where a modified ideal ratio mask(IRM)is defined as the training function during LSTM learning.Simulations show that the proposed system achieves attractive separation performance in noisy and reverberant environments.Specifically,during the untrained acoustic test with limited priors,e.g.,unmatched signal to noise ratio(SNR)and reverberation,the proposed LSTM based algorithm can still outperforms the existing DNN based method in the measures of PESQ and STOI.It indicates our method is more robust in untrained conditions.
文摘Deep Attractor Network(DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory(BLSTM), but the complexity of the DANet model is very high. In this paper, a simplified and powerful DANet model is proposed using Bidirectional Gated neural network(BGRU) instead of BLSTM. The Gaussian Mixture Model(GMM) other than the k-means was applied in DANet as a clustering algorithm to reduce the complexity and increase the learning speed and accuracy. The metrics used in this paper are Signal to Distortion Ratio(SDR), Signal to Interference Ratio(SIR), Signal to Artifact Ratio(SAR), and Perceptual Evaluation Speech Quality(PESQ) score. Two speaker mixture datasets from TIMIT corpus were prepared to evaluate the proposed model, and the system achieved 12.3 dB and 2.94 for SDR and PESQ scores respectively, which were better than the original DANet model. Other improvements were 20.7% and 17.9% in the number of parameters and time training respectively. The model was applied on mixed Arabic speech signals and the results were better than that in English.
文摘We proposed a method using latent regression Bayesian network (LRBN) toextract the shared speech feature for the input of end-to-end speech recognition model.The structure of LRBN is compact and its parameter learning is fast. Compared withConvolutional Neural Network, it has a simpler and understood structure and lessparameters to learn. Experimental results show that the advantage of hybridLRBN/Bidirectional Long Short-Term Memory-Connectionist Temporal Classificationarchitecture for Tibetan multi-dialect speech recognition, and demonstrate the LRBN ishelpful to differentiate among multiple language speech sets.
基金This work is supported by the National Key Research and Development Program of China under Grant 2020YFC2004003 and Grant 2020YFC2004002the National Nature Science Foundation of China(NSFC)under Grant No.61571106.
文摘Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio(SNR)environments,and thus achieve unsatisfactory results.In this study,a convolutional neural network with temporal convolution and residual network(TC-ResNet)is proposed to realize speech separation in a complex acoustic environment.A simplified steered-response power phase transform,denoted as GSRP-PHAT,is employed to reduce the computational cost.The extracted features are reshaped to a special tensor as the system inputs and implements temporal convolution,which not only enlarges the receptive field of the convolution layer but also significantly reduces the network computational cost.Residual blocks are used to combine multiresolution features and accelerate the training procedure.A modified ideal ratio mask is applied as the training target.Simulation results demonstrate that the proposed microphone array speech separation algorithm based on TC-ResNet achieves a better performance in terms of distortion ratio,source-to-interference ratio,and short-time objective intelligibility in low SNR and high reverberant environments,particularly in untrained situations.This indicates that the proposed method has generalization to untrained conditions.
文摘使用深度学习技术进行语音分离已经取得了优异的成果。当前主流的语音分离模型主要基于注意力模块或卷积神经网络,它们通过许多中间状态传递信息,难以对较长的语音序列建模导致分离性能不佳。首先提出了一种端到端的双路径语音分离网络(DPCFNet),该网络通过引入改进的密集连接块,使编码器能提取到丰富的语音特征。然后使用卷积增强Transformer(Conformer)作为分离层的主要组成部分,使语音序列中的元素可以直接交互,不再通过中间状态传递信息。最后将Conformer与双路径结构相结合使得该模型能够有效地进行长语音序列建模。实验结果表明,相比于当前主流的Conv-Tasnet算法及DPTNet算法,所提出的模型在信噪失真比(Signal to noise Distortion Ratio,SDR)和尺度不变信噪失真比(Scale-Invariant Signal to noise Distortion Ratio,SI-SDR)上有明显提高,分离性能更好。