摘要
我们研究wav2vec2语音表征学习技术的普通话音素识别能力。分别以音素和音素加声调为识别单位进行神经网络训练,用训练后的模型识别普通话发音句子,分析识别错误的类型,探究识别错误的原因。我们在实验中区分了舌尖前元音[ɿ]、舌尖后元音[ʅ]和舌面元音[i],把汉语拼音转写成一个符号一个音的基本元素,然后用汉语普通话语音数据集THCHS-30的训练集,对语音表征预训练模型wav2vec2.0-base进行训练。在以音素为识别单位,不考虑声调的情况下,THCHS-30的测试集音节错误率仅为6.62%。实验发现识别错误的基本形式是辅音的发音部位分辨不清,韵尾鼻音识别错误占所有识别错误音节的43.86%,前鼻音韵尾和后鼻音韵尾混淆的情况占所有识别错误音节的26.78%,舌尖前音声母和舌尖后音声母分辨不清的情况占所有识别错误音节的9.76%。区分阴平、阳平、上声、去声、轻声的情况下,音节错误率为14.26%,声调识别错误主要集中在产生连读变调的音节上。实验表明wav2vec2语音表征技术有很强的音素识别能力。语音表征技术除了在语音识别领域中的用途之外,还可以用于语音研究和语言调查。
This paper introduces the architectures of speech representation learning neural network wav2vec and wav2vec2,and the development of unsupervised speech recognition systems wav2vec-U and wav2vec-U 2.0.We investigate the performance of wav2vec2 on recognizing Mandarin speech sounds.The wav2vec2.0-base is retrained to distinguish different phonemes and tones in Mandarin.The retrained neural network is used to recognize Mandarin speech sounds.The word error rate is computed.The probable reason of common errors is discussed.We separate the apical dental vowel[ɿ],apical post alveolar vowel[ʅ],and dorsal vowel[i]in Mandarin,and transcribe Chinese Pinyin into a system in which one symbol corresponds to one sound.We distinguish 5 different syllabic tonal patterns in speech.The pre-trained speech representation model wav2vec2.0-base is retrained with the train-set of Mandarin Chinese speech dataset THCHS-30.Taking the deep convolutional neural network speech recognition system as a reference,the performance of Mandarin speech recognition based on wav2vec2 is analyzed.Taking phones as the recognition units,without considering the tone,the THCHS-30 test-set WER is 6.62%.The experiment shows that phone recognition errors are mainly centralized on nasal codas.43.86%of all syllables with incorrect recognition have nasal coda error.The confusion of alveolar nasal and velar nasal is 26.78%of all incorrect syllables.Furthermore the confusion of blade-alveolars and retroflexes is 9.76%of all incorrect syllables.The typical phenomenon reflected by these recognition errors is the slightly poor performance of the wav2vec2 model to distinguish consonants by means of the place of articulation.From the confusion of nasal codas[n]and[ŋ],we can arrive at the conclusion that the method of wav2vec2 is to directly model the time domain signal using speech waveform samples,so it is difficult to distinguish these two sounds because in frequency domain,not in time domain,the transition cues of these two phones are different,the strength of resonance and anti-resonance is also different and zero frequency of[n]is lower than[ŋ].In principle,the frequency domain signal modeling will be better.With consideration of 5 tones,WER is 14.26%,and tone recognition errors are mainly limited to syllables with tone sandhi.The experiment shows that the wav2vec2 speech representation technology effectively models speech sound features and has strong phonemic recognition ability.Language comparison based on wav2vec2 is introduced.In the future,the possible application areas are discussed.In addition to its application in the field of speech recognition,speech representation technology can also be used for speech research and language investigation.In speech research,speech representation technology can be used to do speech sound measurement.The speech features represented by vectors can be used as indicators to compare the differences of speech sounds between different languages.In language investigation,speech representation technology can be used for automatic speech annotation.The experiment finds that although there are some phoneme recognition errors,the phoneme start time and end time position marking of CTC is generally correct.The current speech representation technology needs to address three issues:(1)improve recognition accuracy;(2)reduce the scale of parameters;(3)categorize phonemes automatically.
作者
张金光
孔江平
ZHANG Jinguang;KONG Jiangping
出处
《中国语音学报》
2023年第2期159-166,共8页
Chinese Journal of Phonetics
基金
河北省社会科学基金项目“普通话音节空档的约束条件”的经费支持(编号:HB20YY010)
关键词
语音表征
语音识别
普通话语音
Speech Representation
Speech Recognition
Mandarin Speech Sounds