摘要
为了训练连续语言识别器的语言模型,必须针对识别器应用领域制作训练语料。语料的来源主要有2种:一种是通过整理实际应用场景下录音得到的语料,称为实际场景语料;另一种是用有限状态网络(finite statenetwork,FSN)句法规则方法生成的语料,称为FSN语料。该文重点研究了这2种语料的平衡方法,提出了以实际场景语料和FSN语料中共有的关键词的概率比较为基础,用一定倍数的部分实际场景语料扩展FSN语料,得到最终语言模型训练语料的方法。用该方法得到的语料训练的语言模型使连续语音识别器关键词检出率从55%提高到77%,音节错误率从39%降到30%。
The language model is a very important component of a continuous speech recognition system;however,a training corpus for the language model cannot be easily retrieved from the various corpus resources,such as the real scene corpus and FSN(finite state network) corpus.This paper describes an effective method for retrieving a training corpus from the real scene corpus and the FSN corpus by comparing the probabilities of keywords in both corpuses.This method balances the two corpuses to interpret content with ...
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2008年第S1期730-734,共5页
Journal of Tsinghua University(Science and Technology)