摘要
语音文本自动对齐技术广泛应用于语音识别与合成、内容制作等领域,其主要目的是将语音和相应的参考文本在语句、单词、音素等级别的单元进行对齐,并获得语音与参考文本之间的时间对位信息.最新的先进对齐方法大多基于语音识别,一方面,准确率受限于语音识别效果,识别字错误率高时文语对齐精度明显下降,识别字错误率对对齐精度影响较大;另一方面,这种对齐方法不能有效处理不完全匹配的长篇幅语音和文本的对齐.该文提出一种基于锚点和韵律信息的文语对齐方法,通过基于边界锚点加权的片段标注将语料划分为对齐段和未对齐段,针对未对齐段使用双门限端点检测方法提取韵律信息,并检测语句边界,降低了基于语音识别的对齐方法对语音识别效果的依赖程度.实验结果表明,与目前先进的基于语音识别的文语对齐方法比较,即使在识别字错误率为0.52时,该文所提方法的对齐准确率仍能提升45%以上;在音频文本不匹配程度为0.5时,该文所提方法能提高3%.
Automatic text-speech alignment technology is widely used in speech recognition and synthesis,content production,and other fields.Automatic text-speech alignment aims to align speech with text in sentence,word,and phoneme units and obtain the time alignment information.Most of the recent alignment methods are based on automatic speech recognition(ASR).On the one hand,the alignment accuracy is limited by the word error rate(WER)of ASR.On the other hand,such methods cannot effectively align imperfect transcriptions.This study proposes a text-speech alignment method based on anchor and prosodic information.Through fragment annotation based on boundary anchor weighting,speech is divided into aligned and unaligned fragments.For unaligned fragments,this study extracts their prosodic information by a dual-threshold endpoint detection method and detects the boundaries of sentences.This approach reduces the dependence of ASR-based text-speech alignment on the speech recognition effect.Compared with the current advanced ASR-based text-speech alignment methods,the proposed method can improve alignment accuracy by more than 45%when the WER is 0.52 and by at least 3%when the degree of incomplete matching is 0.5.
作者
徐锴
陶冶
李辉
XU Kai;TAO Ye;LI Hui(School of Information Science and Technology,Qingdao University of Science and Technology,Qingdao 266061,China)
出处
《计算机系统应用》
2023年第4期300-307,共8页
Computer Systems & Applications
基金
国家重点研发计划(2018YFB1702902)
山东省高等学校青创科技支持计划(2019KJN047)。
关键词
语音文本对齐
韵律信息
锚点
自动语音识别
端点检测
text-speech alignment
prosodic information
anchor
automatic speech recognition(ASR)
endpoint detection