Two discriminative methods for solving tone problems in Mandarin speech recognition are presented. First, discriminative training on the HMM (hidden Markov model) based tone models is proposed. Then an integration t...Two discriminative methods for solving tone problems in Mandarin speech recognition are presented. First, discriminative training on the HMM (hidden Markov model) based tone models is proposed. Then an integration technique of tone models into a large vocabulary continuous speech recognition system is presented. Discriminative model weight training based on minimum phone error criteria is adopted aiming at optimal integration of the tone models. The extended Baum Welch algorithm is applied to find the model-dependent weights to scale the acoustic scores and tone scores. Experimental results show that tone recognition rates and continuous speech recognition accuracy can be improved by the discriminatively trained tone model. Performance of a large vocabulary continuous Mandarin speech recognition system can be further enhanced by the discriminatively trained weight combinations due to a better interpolation of the given models.展开更多
This paper presents a method of tone recognition for Mandarin speech by using combination of wavelet transform and hidden Markov modeling techniques. A pitch detector based on singularity detection and multi-resolutio...This paper presents a method of tone recognition for Mandarin speech by using combination of wavelet transform and hidden Markov modeling techniques. A pitch detector based on singularity detection and multi-resolution analysis of wavelet transform is employed for estimation of pitch periods, and hidden Markov modeling with partition Gaussian mixtures probability density function is used for the tone recognition. The algorithm can provide recognition accuracy of 97.22% and 94.47% for speaker-dependent and speaker-independent tone recognition, respectively.展开更多
To utilize the supra-segmental nature of Mandarin tones, this article proposes a feature extraction method for hidden markov model (HMM) based tone modeling. The method uses linear transforms to project Fo(fundamen...To utilize the supra-segmental nature of Mandarin tones, this article proposes a feature extraction method for hidden markov model (HMM) based tone modeling. The method uses linear transforms to project Fo(fundamental frequency) features of neighboring syllables as compensations, and adds them to the original Fo features of the current syUable. The transforms are discriminatively trained by using an objective function termed as "minimum tone error", which is a smooth approximation of tone recognition accuracy. Experiments show that the new tonal features achieve 3.82% tone recognition rate improvement, compared with the baseline, using maximum likelihood trained HMM on the normal F0 features. Further experiments show that discriminative HMM training on the new features is 8.78% better than the baseline.展开更多
This paper describes a method for recognizing Chinese tones in continuous speech. The first and second order differentials of the fundamental frequency logarithmically converted are used as feature parameters. A left-...This paper describes a method for recognizing Chinese tones in continuous speech. The first and second order differentials of the fundamental frequency logarithmically converted are used as feature parameters. A left-to-right hidden Markov modeling with five states, each of which is modeled by a single Gaussian distribution, expresses each of Chinese tones. Non-voiced portions are coded by random values normally distributed to uniformly deal with all the time frames in an utterance. Speaker dependent tone recognition was conducted for ten speakers. The average rate of 81.8% was obtained for these speakers.展开更多
Automatic speech recognition under conditions of a noisy environment remains a challenging problem. Traditionally, methods focused on noise structure, such as spectral subtraction, have been em-ployed to address this ...Automatic speech recognition under conditions of a noisy environment remains a challenging problem. Traditionally, methods focused on noise structure, such as spectral subtraction, have been em-ployed to address this problem, and thus the performance of such methods depends on the accuracy in noise estimation. In this paper, an alternative method, using a harmonic-based spectral reconstruction algo-rithm, is proposed for the enhancement of robust automatic speech recognition. Neither noise estimation nor noise-model training are required in the proposed approach. A spectral subtraction integrated autocorrela-tion function is proposed to determine the pitch for the harmonic model. Recognition results show that the harmonic-based spectral reconstruction approach outperforms spectral subtraction in the middle- and low-signal noise ratio (SNR) ranges. The advantage of the proposed method is more manifest for non-stationary noise, as the algorithm does not require an assumption of stationary noise.展开更多
Emotion mismatch between training and testing will cause system performance decline sharply which is emotional speaker recognition. It is an important idea to solve this problem according to the emotion normalization ...Emotion mismatch between training and testing will cause system performance decline sharply which is emotional speaker recognition. It is an important idea to solve this problem according to the emotion normalization of test speech. This method proceeds from analysis of the differences between every kind of emotional speech and neutral speech. Besides, it takes the baseband mismatch of emotional changes as the main line. At the same time, it gives the corresponding algorithm according to four technical points which are emotional expansion, emotional shield, emotional normalization and score compensation. Compared with the traditional GMM-UBM method, the recognition rate in MASC corpus and EPST corpus was increased by 3.80% and 8.81% respectively.展开更多
A detection system for American English glides/w y r 1] in a knowledge-based automatic speech recognition system is presented. The method uses detection of dips in band-limited energy to total energy ratios, instead o...A detection system for American English glides/w y r 1] in a knowledge-based automatic speech recognition system is presented. The method uses detection of dips in band-limited energy to total energy ratios, instead of detecting dips along the unmodified band-limited energy contours. By using band-limited energy ratio, the dip detection is applicable in not only intervocalic regions but also in non-intervocalic regions. A Gaussian mixture model(GMM) based classifier is then used to separate the detected vowels and nasals. This approach is tested using the TIMIT corpus and results in an overall detection rate of 69.5 %, which is a 4.7 % absolute increase in detection rate compared with an hidden Markov model (HMM) based phone recognizer.展开更多
This paper presents a reliable speaker-independent method of recognizing Chinese tones. An unbiased center-clipping autocorrelation algorithm of pitch period extraction is proposed. A two-dimensional decision vector i...This paper presents a reliable speaker-independent method of recognizing Chinese tones. An unbiased center-clipping autocorrelation algorithm of pitch period extraction is proposed. A two-dimensional decision vector is used for recognizing Chinese tones by passing the pitch period sequence through the procedures of data selection, error correction, data smoothing and curve fitting. The average correct rate of tone recognition for isolated Chinese syllables is over 98%.展开更多
现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(D...现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(DP-Res2Net),提出一种基于时域波形的半监督端到端虚假语音检测方法。首先,为了解决训练数据集和测试数据集两者数据分布差异较大的问题,采用半监督学习进行领域迁移;然后,对于特征工程,直接将时域采样点输入DP-Res2Net中,增加局部的多尺度信息,并充分利用音频片段之间的依赖性;最后,输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量,用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能,实验结果表明它的等错误率(EER)为19.97%,与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的,且具有更高的泛化能力。展开更多
文摘Two discriminative methods for solving tone problems in Mandarin speech recognition are presented. First, discriminative training on the HMM (hidden Markov model) based tone models is proposed. Then an integration technique of tone models into a large vocabulary continuous speech recognition system is presented. Discriminative model weight training based on minimum phone error criteria is adopted aiming at optimal integration of the tone models. The extended Baum Welch algorithm is applied to find the model-dependent weights to scale the acoustic scores and tone scores. Experimental results show that tone recognition rates and continuous speech recognition accuracy can be improved by the discriminatively trained tone model. Performance of a large vocabulary continuous Mandarin speech recognition system can be further enhanced by the discriminatively trained weight combinations due to a better interpolation of the given models.
基金Supported by the National Natural Science Foundatiuon of China
文摘This paper presents a method of tone recognition for Mandarin speech by using combination of wavelet transform and hidden Markov modeling techniques. A pitch detector based on singularity detection and multi-resolution analysis of wavelet transform is employed for estimation of pitch periods, and hidden Markov modeling with partition Gaussian mixtures probability density function is used for the tone recognition. The algorithm can provide recognition accuracy of 97.22% and 94.47% for speaker-dependent and speaker-independent tone recognition, respectively.
文摘To utilize the supra-segmental nature of Mandarin tones, this article proposes a feature extraction method for hidden markov model (HMM) based tone modeling. The method uses linear transforms to project Fo(fundamental frequency) features of neighboring syllables as compensations, and adds them to the original Fo features of the current syUable. The transforms are discriminatively trained by using an objective function termed as "minimum tone error", which is a smooth approximation of tone recognition accuracy. Experiments show that the new tonal features achieve 3.82% tone recognition rate improvement, compared with the baseline, using maximum likelihood trained HMM on the normal F0 features. Further experiments show that discriminative HMM training on the new features is 8.78% better than the baseline.
文摘This paper describes a method for recognizing Chinese tones in continuous speech. The first and second order differentials of the fundamental frequency logarithmically converted are used as feature parameters. A left-to-right hidden Markov modeling with five states, each of which is modeled by a single Gaussian distribution, expresses each of Chinese tones. Non-voiced portions are coded by random values normally distributed to uniformly deal with all the time frames in an utterance. Speaker dependent tone recognition was conducted for ten speakers. The average rate of 81.8% was obtained for these speakers.
基金Supported by the National Natural Science Foundation of China (No. 60072011)
文摘Automatic speech recognition under conditions of a noisy environment remains a challenging problem. Traditionally, methods focused on noise structure, such as spectral subtraction, have been em-ployed to address this problem, and thus the performance of such methods depends on the accuracy in noise estimation. In this paper, an alternative method, using a harmonic-based spectral reconstruction algo-rithm, is proposed for the enhancement of robust automatic speech recognition. Neither noise estimation nor noise-model training are required in the proposed approach. A spectral subtraction integrated autocorrela-tion function is proposed to determine the pitch for the harmonic model. Recognition results show that the harmonic-based spectral reconstruction approach outperforms spectral subtraction in the middle- and low-signal noise ratio (SNR) ranges. The advantage of the proposed method is more manifest for non-stationary noise, as the algorithm does not require an assumption of stationary noise.
文摘Emotion mismatch between training and testing will cause system performance decline sharply which is emotional speaker recognition. It is an important idea to solve this problem according to the emotion normalization of test speech. This method proceeds from analysis of the differences between every kind of emotional speech and neutral speech. Besides, it takes the baseband mismatch of emotional changes as the main line. At the same time, it gives the corresponding algorithm according to four technical points which are emotional expansion, emotional shield, emotional normalization and score compensation. Compared with the traditional GMM-UBM method, the recognition rate in MASC corpus and EPST corpus was increased by 3.80% and 8.81% respectively.
基金The Ministry of Knowledge Economy,Korea,under the Infor mation Technology Research Center support program supervised by the National IT Industry Promotion Agency(NIPA-2012-H0301-12-2006)
文摘A detection system for American English glides/w y r 1] in a knowledge-based automatic speech recognition system is presented. The method uses detection of dips in band-limited energy to total energy ratios, instead of detecting dips along the unmodified band-limited energy contours. By using band-limited energy ratio, the dip detection is applicable in not only intervocalic regions but also in non-intervocalic regions. A Gaussian mixture model(GMM) based classifier is then used to separate the detected vowels and nasals. This approach is tested using the TIMIT corpus and results in an overall detection rate of 69.5 %, which is a 4.7 % absolute increase in detection rate compared with an hidden Markov model (HMM) based phone recognizer.
基金The Project is Supported by the National Natural Science Foundation of China
文摘This paper presents a reliable speaker-independent method of recognizing Chinese tones. An unbiased center-clipping autocorrelation algorithm of pitch period extraction is proposed. A two-dimensional decision vector is used for recognizing Chinese tones by passing the pitch period sequence through the procedures of data selection, error correction, data smoothing and curve fitting. The average correct rate of tone recognition for isolated Chinese syllables is over 98%.
文摘现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(DP-Res2Net),提出一种基于时域波形的半监督端到端虚假语音检测方法。首先,为了解决训练数据集和测试数据集两者数据分布差异较大的问题,采用半监督学习进行领域迁移;然后,对于特征工程,直接将时域采样点输入DP-Res2Net中,增加局部的多尺度信息,并充分利用音频片段之间的依赖性;最后,输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量,用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能,实验结果表明它的等错误率(EER)为19.97%,与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的,且具有更高的泛化能力。