Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while bla...Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while black-box attacks are proposed without such information to meet practical situations.Existing black-box attacks improve trans-ferability by integrating multiple models or training on multiple datasets,but these methods are costly.Motivated by the optimisation strategy with spatial information on the perturbed paths and samples,we propose a Dual Spatial Momentum Iterative Fast Gradient Sign Method(DS-MI-FGSM)to improve the transferability of black-box at-tacks against SRSs.Specifically,DS-MI-FGSM only needs a single data and one model as the input;by extending to the data and model neighbouring spaces,it generates adver-sarial examples against the integrating models.To reduce the risk of overfitting,DS-MI-FGSM also introduces gradient masking to improve transferability.The authors conduct extensive experiments regarding the speaker recognition task,and the results demonstrate the effectiveness of their method,which can achieve up to 92%attack success rate on the victim model in black-box scenarios with only one known model.展开更多
A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech....A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech. Then, the recognized emotion speech is adjusted by prosody modification. Different methods including Gaussian normalization, the Gaussian mixture model (GMM) and support vector regression (SVR) are adopted to define the mapping rules of F0s between emotional and neutral speech, and the average linear ratio is used for the duration modification. Finally, the modified emotional speech is employed for the speaker recognition. The experimental results show that the proposed ESRS can significantly improve the performance of emotional speaker recognition, and the identification rate (IR) is higher than that of the traditional recognition system. The emotional speech with F0 and duration modifications is closer to the neutral one.展开更多
This paper attempts to argue that in the age of‘World Englishes', it is not necessary to differentiate native speaker teachers from non-native speaker teachers. It is concluded that non-native speaker teachers ca...This paper attempts to argue that in the age of‘World Englishes', it is not necessary to differentiate native speaker teachers from non-native speaker teachers. It is concluded that non-native speaker teachers can be as effective as their native colleagues and they have equal chance to achieve professional success, even though native speaker teachers have great advantages over non-native teachers in some aspects. It is time for employers, as well as ELT professionals to shut their eyes to the glaring differences between native speaker teachers and non-native speaker teachers and optimize such unique resources.展开更多
The target of much language teaching and learning is to make students approximate to native speakers.The only rightful speak ers of a language are its native speakers.Contrary to these contemporary views,however,this ...The target of much language teaching and learning is to make students approximate to native speakers.The only rightful speak ers of a language are its native speakers.Contrary to these contemporary views,however,this paper argues that the obligation of the lan guage teacher is to help students to use L2 effectively not to simply imitate native speaker.A successful L2 user who comes from the group of L2 learners can be a model for students.Therefore,non-native teachers with a high degree of language proficiency and good teaching skills can be ideal and qualified language teachers.展开更多
This paper analyses the author's escapism In this Composed Upon Westminster Bridge, Sept. 3, 1802 ,and conclusion is reached that the poet wrote this particular poem for his beloved sister. And the sister 's c...This paper analyses the author's escapism In this Composed Upon Westminster Bridge, Sept. 3, 1802 ,and conclusion is reached that the poet wrote this particular poem for his beloved sister. And the sister 's calmness and complacency came to be the best pill to heal the wounds of the poet.展开更多
In audio stream containing multiple speakers, speaker diarization aids in ascertaining "who speak when". This is an unsupervised task as there is no prior information about the speakers. It labels the speech...In audio stream containing multiple speakers, speaker diarization aids in ascertaining "who speak when". This is an unsupervised task as there is no prior information about the speakers. It labels the speech signal conforming to the identity of the speaker, namely, input audio stream is partitioned into homogeneous segments. In this work, we present a novel speaker diarization system using the Tangent weighted Mel frequency cepstral coefficient(TMFCC) as the feature parameter and Lion algorithm for the clustering of the voice activity detected audio streams into particular speaker groups. Thus the two main tasks of the speaker indexing, i.e., speaker segmentation and speaker clustering, are improved. The TMFCC makes use of the low energy frame as well as the high energy frame with more effect, improving the performance of the proposed system. The experiments using the audio signal from the ELSDSR corpus datasets having three speakers, four speakers and five speakers are analyzed for the proposed system. The evaluation of the proposed speaker diarization system based on the tracking distance, tracking time as the evaluation metrics is done and the experimental results show that the speaker diarization system with the TMFCC parameterization and Lion based clustering is found to be superior over existing diarization systems with 95% tracking accuracy.展开更多
A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With onl...A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation展开更多
This paper reports on part of the findings of a large-scale study exploring the viewpoints of Chinese ELT stakeholders(students,teachers and administrators)on native speakerism in order to find out whether current EFL...This paper reports on part of the findings of a large-scale study exploring the viewpoints of Chinese ELT stakeholders(students,teachers and administrators)on native speakerism in order to find out whether current EFL education in China is still affected by this chauvinistic ideology.The analysis of data via a critical lens reveals that the vast majority of the participants conferred upon NS products(teacher,language,culture and teaching methodology)a status superior to that granted to the NNS counterparts and failed to see linguacultural and epistemological inequalities between the English speaking West and traditional NNS countries,inter alia,China.These findings suggest that the three participant groups as an entirety succumb to native speakerism,and by extension that ELT in China is still haunted to a great degree by this ideology.Given that this study treats each participant group separately,future studies are expected to explore inter-group interactions in ideology.展开更多
The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), i...The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in order to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio database, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language, is about 2 seconds on a common PC.展开更多
In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor anal...In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor analysis and locality preserving projection (LPP). LPDP can effectively use the speaker label information of speech data. Through optimization, LPDP can maintain the inherent manifold local structure of the speech data samples of the same speaker by reducing the distance between them. At the same time, LPDP can enhance the discriminability of the embedding space by expanding the distance between the speech data samples of different speakers. The proposed method is compared with LPP and total variability factor analysis on the NIST SRE 2010 telephone-telephone core condition. The experimental results indicate that the proposed LPDP can overcome the deficiency of LPP and total variability factor analysis and can further improve the system performance.展开更多
Automatic speaker recognition(ASR)systems are the field of Human-machine interaction and scientists have been using feature extraction and feature matching methods to analyze and synthesize these signals.One of the mo...Automatic speaker recognition(ASR)systems are the field of Human-machine interaction and scientists have been using feature extraction and feature matching methods to analyze and synthesize these signals.One of the most commonly used methods for feature extraction is Mel Frequency Cepstral Coefficients(MFCCs).Recent researches show that MFCCs are successful in processing the voice signal with high accuracies.MFCCs represents a sequence of voice signal-specific features.This experimental analysis is proposed to distinguish Turkish speakers by extracting the MFCCs from the speech recordings.Since the human perception of sound is not linear,after the filterbank step in theMFCC method,we converted the obtained log filterbanks into decibel(dB)features-based spectrograms without applying the Discrete Cosine Transform(DCT).A new dataset was created with converted spectrogram into a 2-D array.Several learning algorithms were implementedwith a 10-fold cross-validationmethod to detect the speaker.The highest accuracy of 90.2%was achieved using Multi-layer Perceptron(MLP)with tanh activation function.The most important output of this study is the inclusion of human voice as a new feature set.展开更多
Public speaking is a part of communication.Good public speaking can convey clear、 persuasive ideas or opinions and also can become an effective bridge between the audience and the speaker.This paper is dealing with s...Public speaking is a part of communication.Good public speaking can convey clear、 persuasive ideas or opinions and also can become an effective bridge between the audience and the speaker.This paper is dealing with some skills of public speaking- from several different aspects that should be noticed in public speaking.展开更多
While the majority of nonnative speaker English teachers teach alongside NS teachers,research on the role of native speaker English teachers in China's teaching context and the attitudes of university students tow...While the majority of nonnative speaker English teachers teach alongside NS teachers,research on the role of native speaker English teachers in China's teaching context and the attitudes of university students towards them have been rarely conducted.This essay discusses the implications of cultural differences for the language classroom,and the different cultures of learning with regard to language teaching and learning in China and the Wes.The conclusion suggests that it is of great importance to have a good sense of cultural awareness and an open mind for cultural interactions,in order to benefit both language learners and native speaker teachers in the cross-cultural classroom.展开更多
This paper discusses application of fractal dimensions to speech processing. Generalized dimensions of arbitrary orders and associated fractal parameters are used in speaker identification. A characteristic vactor bas...This paper discusses application of fractal dimensions to speech processing. Generalized dimensions of arbitrary orders and associated fractal parameters are used in speaker identification. A characteristic vactor based on these parameters is formed, and a recognition criterion definded in order to identify individual speakers. Experimental results show the usefulness of fractal dimensions in characterizing speaker identity.展开更多
This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this...This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this speech coder has been improved by using the dynamic features generated by an algorithm for speech parameter generation from HMM because the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariance of those. The encoder part is equivalent to an HMM based phoneme recognizer and transmits phoneme indexes, state durations, pitch information and speaker characteristics adaptation vectors to the decoder. The decoder receives those messages and concatenates phoneme HMM sequence according to the phoneme indexes. Then the decoder generates a sequence of mel cepstral coefficient vectors using HMM based speech parameter generation technique. Finally the decoder synthesizes speech by directly exciting the MLSA(Mel Log Spectrum Approximation) filter with the generated mel cepstral coefficient vectors, according to the pitch information.展开更多
This paper examines whether or not Chinese native speakers (CNSs) have difficulties in understanding English counterfactuals, whether CNSs have counterfactual reasoning problems in their own language, what the causes ...This paper examines whether or not Chinese native speakers (CNSs) have difficulties in understanding English counterfactuals, whether CNSs have counterfactual reasoning problems in their own language, what the causes of these difficulties may be, and the problems in teaching English subjunctives. It also proposes on how to improve CNSs’ English counterfactual comprehension.展开更多
This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant chara...This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.展开更多
The performance of speaker verification systems is often compromised under real world environments. For example, variations in handset characteristics could cause severe performance degradation. This paper presents a...The performance of speaker verification systems is often compromised under real world environments. For example, variations in handset characteristics could cause severe performance degradation. This paper presents a novel method to overcome this problem by using a non linear handset mapper. Under this method, a mapper is constructed by training an elliptical basis function network using distorted speech features as inputs and the corresponding clean features as the desired outputs. During feature recuperation, clean features are recovered by feeding the distorted features to the feature mapper. The recovered features are then presented to a speaker model as if they were derived from clean speech. Experimental evaluations based on 258 speakers of the TIMIT and NTIMIT corpuses suggest that the feature mappers improve the verification performance remarkably.展开更多
The first step of missing feature methods in text-independent speaker identification is to identify highly corrupted spectrographic representation of speech as missing feature. Most mask estimation techniques rely on ...The first step of missing feature methods in text-independent speaker identification is to identify highly corrupted spectrographic representation of speech as missing feature. Most mask estimation techniques rely on explicit estimation of the characteristics of the corrupting noise and usually fail to work with inaccurate estimation of noise. We present a mask estimation technique that uses neural networks to determine the reliability of spectrographic elements. Without any prior knowledge of the noise or prior probability of speech, this method exploits only the characteristics of the speech signal. Experiments were performed on speech corrupted by stationary F16 noise and non-stationary Babble noise from 5dB to 20 dB separately, using cluster based reconstruction missing feature method. The result performs better recognition accuracy than conventional spectral subtraction mask estimation methods.展开更多
基金The Major Key Project of PCL,Grant/Award Number:PCL2022A03National Natural Science Foundation of China,Grant/Award Numbers:61976064,62372137Zhejiang Provincial Natural Science Foundation of China,Grant/Award Number:LZ22F020007。
文摘Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while black-box attacks are proposed without such information to meet practical situations.Existing black-box attacks improve trans-ferability by integrating multiple models or training on multiple datasets,but these methods are costly.Motivated by the optimisation strategy with spatial information on the perturbed paths and samples,we propose a Dual Spatial Momentum Iterative Fast Gradient Sign Method(DS-MI-FGSM)to improve the transferability of black-box at-tacks against SRSs.Specifically,DS-MI-FGSM only needs a single data and one model as the input;by extending to the data and model neighbouring spaces,it generates adver-sarial examples against the integrating models.To reduce the risk of overfitting,DS-MI-FGSM also introduces gradient masking to improve transferability.The authors conduct extensive experiments regarding the speaker recognition task,and the results demonstrate the effectiveness of their method,which can achieve up to 92%attack success rate on the victim model in black-box scenarios with only one known model.
基金The National Natural Science Foundation of China (No.60872073, 60975017, 51075068)the Natural Science Foundation of Guangdong Province (No. 10252800001000001)the Natural Science Foundation of Jiangsu Province (No. BK2010546)
文摘A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech. Then, the recognized emotion speech is adjusted by prosody modification. Different methods including Gaussian normalization, the Gaussian mixture model (GMM) and support vector regression (SVR) are adopted to define the mapping rules of F0s between emotional and neutral speech, and the average linear ratio is used for the duration modification. Finally, the modified emotional speech is employed for the speaker recognition. The experimental results show that the proposed ESRS can significantly improve the performance of emotional speaker recognition, and the identification rate (IR) is higher than that of the traditional recognition system. The emotional speech with F0 and duration modifications is closer to the neutral one.
文摘This paper attempts to argue that in the age of‘World Englishes', it is not necessary to differentiate native speaker teachers from non-native speaker teachers. It is concluded that non-native speaker teachers can be as effective as their native colleagues and they have equal chance to achieve professional success, even though native speaker teachers have great advantages over non-native teachers in some aspects. It is time for employers, as well as ELT professionals to shut their eyes to the glaring differences between native speaker teachers and non-native speaker teachers and optimize such unique resources.
文摘The target of much language teaching and learning is to make students approximate to native speakers.The only rightful speak ers of a language are its native speakers.Contrary to these contemporary views,however,this paper argues that the obligation of the lan guage teacher is to help students to use L2 effectively not to simply imitate native speaker.A successful L2 user who comes from the group of L2 learners can be a model for students.Therefore,non-native teachers with a high degree of language proficiency and good teaching skills can be ideal and qualified language teachers.
文摘This paper analyses the author's escapism In this Composed Upon Westminster Bridge, Sept. 3, 1802 ,and conclusion is reached that the poet wrote this particular poem for his beloved sister. And the sister 's calmness and complacency came to be the best pill to heal the wounds of the poet.
文摘In audio stream containing multiple speakers, speaker diarization aids in ascertaining "who speak when". This is an unsupervised task as there is no prior information about the speakers. It labels the speech signal conforming to the identity of the speaker, namely, input audio stream is partitioned into homogeneous segments. In this work, we present a novel speaker diarization system using the Tangent weighted Mel frequency cepstral coefficient(TMFCC) as the feature parameter and Lion algorithm for the clustering of the voice activity detected audio streams into particular speaker groups. Thus the two main tasks of the speaker indexing, i.e., speaker segmentation and speaker clustering, are improved. The TMFCC makes use of the low energy frame as well as the high energy frame with more effect, improving the performance of the proposed system. The experiments using the audio signal from the ELSDSR corpus datasets having three speakers, four speakers and five speakers are analyzed for the proposed system. The evaluation of the proposed speaker diarization system based on the tracking distance, tracking time as the evaluation metrics is done and the experimental results show that the speaker diarization system with the TMFCC parameterization and Lion based clustering is found to be superior over existing diarization systems with 95% tracking accuracy.
文摘A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation
文摘This paper reports on part of the findings of a large-scale study exploring the viewpoints of Chinese ELT stakeholders(students,teachers and administrators)on native speakerism in order to find out whether current EFL education in China is still affected by this chauvinistic ideology.The analysis of data via a critical lens reveals that the vast majority of the participants conferred upon NS products(teacher,language,culture and teaching methodology)a status superior to that granted to the NNS counterparts and failed to see linguacultural and epistemological inequalities between the English speaking West and traditional NNS countries,inter alia,China.These findings suggest that the three participant groups as an entirety succumb to native speakerism,and by extension that ELT in China is still haunted to a great degree by this ideology.Given that this study treats each participant group separately,future studies are expected to explore inter-group interactions in ideology.
文摘The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in order to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio database, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language, is about 2 seconds on a common PC.
文摘In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor analysis and locality preserving projection (LPP). LPDP can effectively use the speaker label information of speech data. Through optimization, LPDP can maintain the inherent manifold local structure of the speech data samples of the same speaker by reducing the distance between them. At the same time, LPDP can enhance the discriminability of the embedding space by expanding the distance between the speech data samples of different speakers. The proposed method is compared with LPP and total variability factor analysis on the NIST SRE 2010 telephone-telephone core condition. The experimental results indicate that the proposed LPDP can overcome the deficiency of LPP and total variability factor analysis and can further improve the system performance.
基金This work was supported by the GRRC program of Gyeonggi province.[GRRC-Gachon2020(B04),Development of AI-based Healthcare Devices].
文摘Automatic speaker recognition(ASR)systems are the field of Human-machine interaction and scientists have been using feature extraction and feature matching methods to analyze and synthesize these signals.One of the most commonly used methods for feature extraction is Mel Frequency Cepstral Coefficients(MFCCs).Recent researches show that MFCCs are successful in processing the voice signal with high accuracies.MFCCs represents a sequence of voice signal-specific features.This experimental analysis is proposed to distinguish Turkish speakers by extracting the MFCCs from the speech recordings.Since the human perception of sound is not linear,after the filterbank step in theMFCC method,we converted the obtained log filterbanks into decibel(dB)features-based spectrograms without applying the Discrete Cosine Transform(DCT).A new dataset was created with converted spectrogram into a 2-D array.Several learning algorithms were implementedwith a 10-fold cross-validationmethod to detect the speaker.The highest accuracy of 90.2%was achieved using Multi-layer Perceptron(MLP)with tanh activation function.The most important output of this study is the inclusion of human voice as a new feature set.
文摘Public speaking is a part of communication.Good public speaking can convey clear、 persuasive ideas or opinions and also can become an effective bridge between the audience and the speaker.This paper is dealing with some skills of public speaking- from several different aspects that should be noticed in public speaking.
文摘While the majority of nonnative speaker English teachers teach alongside NS teachers,research on the role of native speaker English teachers in China's teaching context and the attitudes of university students towards them have been rarely conducted.This essay discusses the implications of cultural differences for the language classroom,and the different cultures of learning with regard to language teaching and learning in China and the Wes.The conclusion suggests that it is of great importance to have a good sense of cultural awareness and an open mind for cultural interactions,in order to benefit both language learners and native speaker teachers in the cross-cultural classroom.
文摘This paper discusses application of fractal dimensions to speech processing. Generalized dimensions of arbitrary orders and associated fractal parameters are used in speaker identification. A characteristic vactor based on these parameters is formed, and a recognition criterion definded in order to identify individual speakers. Experimental results show the usefulness of fractal dimensions in characterizing speaker identity.
文摘This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this speech coder has been improved by using the dynamic features generated by an algorithm for speech parameter generation from HMM because the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariance of those. The encoder part is equivalent to an HMM based phoneme recognizer and transmits phoneme indexes, state durations, pitch information and speaker characteristics adaptation vectors to the decoder. The decoder receives those messages and concatenates phoneme HMM sequence according to the phoneme indexes. Then the decoder generates a sequence of mel cepstral coefficient vectors using HMM based speech parameter generation technique. Finally the decoder synthesizes speech by directly exciting the MLSA(Mel Log Spectrum Approximation) filter with the generated mel cepstral coefficient vectors, according to the pitch information.
文摘This paper examines whether or not Chinese native speakers (CNSs) have difficulties in understanding English counterfactuals, whether CNSs have counterfactual reasoning problems in their own language, what the causes of these difficulties may be, and the problems in teaching English subjunctives. It also proposes on how to improve CNSs’ English counterfactual comprehension.
基金Project supported by the National Natural Science Foundation of China (Grant No.60903186)the Shanghai Leading Academic Discipline Project (Grant No.J50104)
文摘This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.
文摘The performance of speaker verification systems is often compromised under real world environments. For example, variations in handset characteristics could cause severe performance degradation. This paper presents a novel method to overcome this problem by using a non linear handset mapper. Under this method, a mapper is constructed by training an elliptical basis function network using distorted speech features as inputs and the corresponding clean features as the desired outputs. During feature recuperation, clean features are recovered by feeding the distorted features to the feature mapper. The recovered features are then presented to a speaker model as if they were derived from clean speech. Experimental evaluations based on 258 speakers of the TIMIT and NTIMIT corpuses suggest that the feature mappers improve the verification performance remarkably.
文摘The first step of missing feature methods in text-independent speaker identification is to identify highly corrupted spectrographic representation of speech as missing feature. Most mask estimation techniques rely on explicit estimation of the characteristics of the corrupting noise and usually fail to work with inaccurate estimation of noise. We present a mask estimation technique that uses neural networks to determine the reliability of spectrographic elements. Without any prior knowledge of the noise or prior probability of speech, this method exploits only the characteristics of the speech signal. Experiments were performed on speech corrupted by stationary F16 noise and non-stationary Babble noise from 5dB to 20 dB separately, using cluster based reconstruction missing feature method. The result performs better recognition accuracy than conventional spectral subtraction mask estimation methods.