Objective speech quality is difficult to be measured without the input reference speech.Mapping methods using data mining are investigated and designed to improve the output-based speech quality assessment algorithm.T...Objective speech quality is difficult to be measured without the input reference speech.Mapping methods using data mining are investigated and designed to improve the output-based speech quality assessment algorithm.The degraded speech is firstly separated into three classes(unvoiced,voiced and silence),and then the consistency measurement between the degraded speech signal and the pre-trained reference model for each class is calculated and mapped to an objective speech quality score using data mining.Fuzzy Gaussian mixture model(GMM)is used to generate the artificial reference model trained on perceptual linear predictive(PLP)features.The mean opinion score(MOS)mapping methods including multivariate non-linear regression(MNLR),fuzzy neural network(FNN)and support vector regression(SVR)are designed and compared with the standard ITU-T P.563 method.Experimental results show that the assessment methods with data mining perform better than ITU-T P.563.Moreover,FNN and SVR are more efficient than MNLR,and FNN performs best with 14.50% increase in the correlation coefficient and 32.76% decrease in the root-mean-square MOS error.展开更多
Based on fuzzy Gaussian mixture model (FGMM) and support vector regression (SVR),an improved version of non-intrusive objective measurement for assessing quality of output speech without inputting clean speech is ...Based on fuzzy Gaussian mixture model (FGMM) and support vector regression (SVR),an improved version of non-intrusive objective measurement for assessing quality of output speech without inputting clean speech is proposed for narrowband speech.Its perceptual linear predictive (PLP) features extracted from clean speech and clustered by FGMM are used as an artificial reference model.Input speech is separated into three classes,for each a consistency parameter between each feature pair from test speech signals and its counterpart in the pre-trained FGMM reference model is calculated and mapped to an objective speech quality score using SVR method.The correlation degree between subjective mean opinion score (MOS) and objective MOS is analyzed.Experimental results show that the proposed method offers an effective technique and can give better performances than the ITU-T P.563 method under most of the test conditions for narrowband speech.展开更多
语音合成技术是指给定文本经过模型处理生成目标说话人语音的过程,该技术在现实社会中已经得到广泛应用。在众多的语音合成模型中,VITS(The Variational Inference for Text-to-Speech)模型将多任务损失函数进行有效组合,相比以往的模型...语音合成技术是指给定文本经过模型处理生成目标说话人语音的过程,该技术在现实社会中已经得到广泛应用。在众多的语音合成模型中,VITS(The Variational Inference for Text-to-Speech)模型将多任务损失函数进行有效组合,相比以往的模型,能够生成质量更高、听感更自然的语音。然而,现有模型依赖多个损失函数,暂时缺乏对其有效权衡的研究。因此,在现有模型损失函数的基础上,引入了梯度归一化自适应损失平衡优化方法,它根据模型不同损失函数的量级与不同子任务的训练速度来平衡各损失函数之间的权重,以验证该方法在语音合成任务中的适用性。在公开的中文语音合成数据集上评估了该方法合成语音的准确度与自然度,结果表明,采用此损失函数的模型在性能上得到了提升,证明了方法的有效性。展开更多
This paper addresses the problem of single-channel speech enhancement in the adverse environment. The critical-band rate scale based on improved multi-band spectral subtraction is investigated in this study for enhanc...This paper addresses the problem of single-channel speech enhancement in the adverse environment. The critical-band rate scale based on improved multi-band spectral subtraction is investigated in this study for enhancement of single-channel speech. In this work, the whole speech spectrum is divided into different non-uniformly spaced frequency bands in accordance with the critical-band rate scale of the psycho-acoustic model and the spectral over-subtraction is carried-out separately in each band. In addition, for the estimation of the noise from each band, the adaptive noise estimation approach is used and does not require explicit speech silence detection. The noise is estimated and updated by adaptively smoothing the noisy signal power in each band. The smoothing parameter is controlled by a-posteriori signal-to-noise ratio (SNR). For the performance analysis of the proposed algorithm, the objective measures, such as, SNR, segmental SNR, and perceptual evaluations of the speech quality are conducted for the variety of noises at different levels of SNRs. The speech spectrogram and objective evaluations of the proposed algorithm are compared with other standard speech enhancement algorithms and proved that the musical structure of the remnant noise and background noise is better suppressed by the proposed algorithm.展开更多
基金Projects(61001188,1161140319)supported by the National Natural Science Foundation of ChinaProject(2012ZX03001034)supported by the National Science and Technology Major ProjectProject(YETP1202)supported by Beijing Higher Education Young Elite Teacher Project,China
文摘Objective speech quality is difficult to be measured without the input reference speech.Mapping methods using data mining are investigated and designed to improve the output-based speech quality assessment algorithm.The degraded speech is firstly separated into three classes(unvoiced,voiced and silence),and then the consistency measurement between the degraded speech signal and the pre-trained reference model for each class is calculated and mapped to an objective speech quality score using data mining.Fuzzy Gaussian mixture model(GMM)is used to generate the artificial reference model trained on perceptual linear predictive(PLP)features.The mean opinion score(MOS)mapping methods including multivariate non-linear regression(MNLR),fuzzy neural network(FNN)and support vector regression(SVR)are designed and compared with the standard ITU-T P.563 method.Experimental results show that the assessment methods with data mining perform better than ITU-T P.563.Moreover,FNN and SVR are more efficient than MNLR,and FNN performs best with 14.50% increase in the correlation coefficient and 32.76% decrease in the root-mean-square MOS error.
文摘Based on fuzzy Gaussian mixture model (FGMM) and support vector regression (SVR),an improved version of non-intrusive objective measurement for assessing quality of output speech without inputting clean speech is proposed for narrowband speech.Its perceptual linear predictive (PLP) features extracted from clean speech and clustered by FGMM are used as an artificial reference model.Input speech is separated into three classes,for each a consistency parameter between each feature pair from test speech signals and its counterpart in the pre-trained FGMM reference model is calculated and mapped to an objective speech quality score using SVR method.The correlation degree between subjective mean opinion score (MOS) and objective MOS is analyzed.Experimental results show that the proposed method offers an effective technique and can give better performances than the ITU-T P.563 method under most of the test conditions for narrowband speech.
文摘语音合成技术是指给定文本经过模型处理生成目标说话人语音的过程,该技术在现实社会中已经得到广泛应用。在众多的语音合成模型中,VITS(The Variational Inference for Text-to-Speech)模型将多任务损失函数进行有效组合,相比以往的模型,能够生成质量更高、听感更自然的语音。然而,现有模型依赖多个损失函数,暂时缺乏对其有效权衡的研究。因此,在现有模型损失函数的基础上,引入了梯度归一化自适应损失平衡优化方法,它根据模型不同损失函数的量级与不同子任务的训练速度来平衡各损失函数之间的权重,以验证该方法在语音合成任务中的适用性。在公开的中文语音合成数据集上评估了该方法合成语音的准确度与自然度,结果表明,采用此损失函数的模型在性能上得到了提升,证明了方法的有效性。
文摘This paper addresses the problem of single-channel speech enhancement in the adverse environment. The critical-band rate scale based on improved multi-band spectral subtraction is investigated in this study for enhancement of single-channel speech. In this work, the whole speech spectrum is divided into different non-uniformly spaced frequency bands in accordance with the critical-band rate scale of the psycho-acoustic model and the spectral over-subtraction is carried-out separately in each band. In addition, for the estimation of the noise from each band, the adaptive noise estimation approach is used and does not require explicit speech silence detection. The noise is estimated and updated by adaptively smoothing the noisy signal power in each band. The smoothing parameter is controlled by a-posteriori signal-to-noise ratio (SNR). For the performance analysis of the proposed algorithm, the objective measures, such as, SNR, segmental SNR, and perceptual evaluations of the speech quality are conducted for the variety of noises at different levels of SNRs. The speech spectrogram and objective evaluations of the proposed algorithm are compared with other standard speech enhancement algorithms and proved that the musical structure of the remnant noise and background noise is better suppressed by the proposed algorithm.