This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant chara...This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.展开更多
A speaker adaptation method that combines transformation matrix linear interpolation with maximum a posteriori (MAP) was proposed. Firstly this method can keep the asymptotical characteristic of MAP. Secondly, as the ...A speaker adaptation method that combines transformation matrix linear interpolation with maximum a posteriori (MAP) was proposed. Firstly this method can keep the asymptotical characteristic of MAP. Secondly, as the method uses linear interpolation with several speaker-dependent (SD) transformation matrixes, it can fully use the prior knowledge and keep fast adaptation. The experimental results show that the combined method achieves an 8.24% word error rate reduction with only one adaptation utterance, and keeps asymptotic to the performance of SD model for large amounts of adaptation data.展开更多
A VQ based efficient speech recognition method is introduced, and the key parameters of this method are comparatively studied. This method is especially designed for mandarin speaker dependent small size word set r...A VQ based efficient speech recognition method is introduced, and the key parameters of this method are comparatively studied. This method is especially designed for mandarin speaker dependent small size word set recognition. It has less complexity, less resource consumption but higher ARR (accurate recognition rate) compared with traditional HMM or NN approach. A large scale test on the task of 11 mandarin digits recognition shows that the WER(word error rate) can reach 3 86%. This method is suitable for being embedded in PDA (personal digital assistant), mobile phone and so on to perform voice controlling like digits dialing, name dialing, calculating, voice commanding, etc.展开更多
Perceptual auditory filter banks such as Bark-scale filter bank are widely used as front-end processing in speech recognition systems.However,the problem of the design of optimized filter banks that provide higher acc...Perceptual auditory filter banks such as Bark-scale filter bank are widely used as front-end processing in speech recognition systems.However,the problem of the design of optimized filter banks that provide higher accuracy in recognition tasks is still open.Owing to spectral analysis in feature extraction,an adaptive bands filter bank (ABFB) is presented.The design adopts flexible bandwidths and center frequencies for the frequency responses of the filters and utilizes genetic algorithm (GA) to optimize the design parameters.The optimization process is realized by combining the front-end filter bank with the back-end recognition network in the performance evaluation loop.The deployment of ABFB together with zero-crossing peak amplitude (ZCPA) feature as a front process for radial basis function (RBF) system shows significant improvement in robustness compared with the Bark-scale filter bank.In ABFB,several sub-bands are still more concentrated toward lower frequency but their exact locations are determined by the performance rather than the perceptual criteria.For the ease of optimization,only symmetrical bands are considered here,which still provide satisfactory results.展开更多
A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With onl...A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation展开更多
现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(D...现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(DP-Res2Net),提出一种基于时域波形的半监督端到端虚假语音检测方法。首先,为了解决训练数据集和测试数据集两者数据分布差异较大的问题,采用半监督学习进行领域迁移;然后,对于特征工程,直接将时域采样点输入DP-Res2Net中,增加局部的多尺度信息,并充分利用音频片段之间的依赖性;最后,输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量,用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能,实验结果表明它的等错误率(EER)为19.97%,与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的,且具有更高的泛化能力。展开更多
Functional paralanguage includes considerable emotion information, and it is insensitive to speaker changes. To improve the emotion recognition accuracy under the condition of speaker-independence, a fusion method com...Functional paralanguage includes considerable emotion information, and it is insensitive to speaker changes. To improve the emotion recognition accuracy under the condition of speaker-independence, a fusion method combining the functional paralanguage features with the accompanying paralanguage features is proposed for the speaker-independent speech emotion recognition. Using this method, the functional paralanguages, such as laughter, cry, and sigh, are used to assist speech emotion recognition. The contributions of our work are threefold. First, one emotional speech database including six kinds of functional paralanguage and six typical emotions were recorded by our research group. Second, the functional paralanguage is put forward to recognize the speech emotions combined with the accompanying paralanguage features. Third, a fusion algorithm based on confidences and probabilities is proposed to combine the functional paralanguage features with the accompanying paralanguage features for speech emotion recognition. We evaluate the usefulness of the functional paralanguage features and the fusion algorithm in terms of precision, recall, and F1-measurement on the emotional speech database recorded by our research group. The overall recognition accuracy achieved for six emotions is over 67% in the speaker-independent condition using the functional paralanguage features.展开更多
基金Project supported by the National Natural Science Foundation of China (Grant No.60903186)the Shanghai Leading Academic Discipline Project (Grant No.J50104)
文摘This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.
文摘A speaker adaptation method that combines transformation matrix linear interpolation with maximum a posteriori (MAP) was proposed. Firstly this method can keep the asymptotical characteristic of MAP. Secondly, as the method uses linear interpolation with several speaker-dependent (SD) transformation matrixes, it can fully use the prior knowledge and keep fast adaptation. The experimental results show that the combined method achieves an 8.24% word error rate reduction with only one adaptation utterance, and keeps asymptotic to the performance of SD model for large amounts of adaptation data.
文摘A VQ based efficient speech recognition method is introduced, and the key parameters of this method are comparatively studied. This method is especially designed for mandarin speaker dependent small size word set recognition. It has less complexity, less resource consumption but higher ARR (accurate recognition rate) compared with traditional HMM or NN approach. A large scale test on the task of 11 mandarin digits recognition shows that the WER(word error rate) can reach 3 86%. This method is suitable for being embedded in PDA (personal digital assistant), mobile phone and so on to perform voice controlling like digits dialing, name dialing, calculating, voice commanding, etc.
基金Project(61072087) supported by the National Natural Science Foundation of ChinaProject(20093048) supported by Shanxi ProvincialGraduate Innovation Fund of China
文摘Perceptual auditory filter banks such as Bark-scale filter bank are widely used as front-end processing in speech recognition systems.However,the problem of the design of optimized filter banks that provide higher accuracy in recognition tasks is still open.Owing to spectral analysis in feature extraction,an adaptive bands filter bank (ABFB) is presented.The design adopts flexible bandwidths and center frequencies for the frequency responses of the filters and utilizes genetic algorithm (GA) to optimize the design parameters.The optimization process is realized by combining the front-end filter bank with the back-end recognition network in the performance evaluation loop.The deployment of ABFB together with zero-crossing peak amplitude (ZCPA) feature as a front process for radial basis function (RBF) system shows significant improvement in robustness compared with the Bark-scale filter bank.In ABFB,several sub-bands are still more concentrated toward lower frequency but their exact locations are determined by the performance rather than the perceptual criteria.For the ease of optimization,only symmetrical bands are considered here,which still provide satisfactory results.
文摘A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation
文摘现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(DP-Res2Net),提出一种基于时域波形的半监督端到端虚假语音检测方法。首先,为了解决训练数据集和测试数据集两者数据分布差异较大的问题,采用半监督学习进行领域迁移;然后,对于特征工程,直接将时域采样点输入DP-Res2Net中,增加局部的多尺度信息,并充分利用音频片段之间的依赖性;最后,输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量,用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能,实验结果表明它的等错误率(EER)为19.97%,与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的,且具有更高的泛化能力。
基金supported by the National Natural Science Foundation of China (Nos. 61272211 and 61170126)the Natural Science Foundation of Jiangsu Province (No. BK2011521)the Research Foundation for Talented Scholars of Jiangsu University (No. 10JDG065), China
文摘Functional paralanguage includes considerable emotion information, and it is insensitive to speaker changes. To improve the emotion recognition accuracy under the condition of speaker-independence, a fusion method combining the functional paralanguage features with the accompanying paralanguage features is proposed for the speaker-independent speech emotion recognition. Using this method, the functional paralanguages, such as laughter, cry, and sigh, are used to assist speech emotion recognition. The contributions of our work are threefold. First, one emotional speech database including six kinds of functional paralanguage and six typical emotions were recorded by our research group. Second, the functional paralanguage is put forward to recognize the speech emotions combined with the accompanying paralanguage features. Third, a fusion algorithm based on confidences and probabilities is proposed to combine the functional paralanguage features with the accompanying paralanguage features for speech emotion recognition. We evaluate the usefulness of the functional paralanguage features and the fusion algorithm in terms of precision, recall, and F1-measurement on the emotional speech database recorded by our research group. The overall recognition accuracy achieved for six emotions is over 67% in the speaker-independent condition using the functional paralanguage features.