摘要
语言模型自适应的目的是减小模型与识别任务之间的语言差异。这些差异包括词典差异、风格和内容差异以及模型的概率分布差异。本文提出一种新的非迭代的中文新词提取方法和一种新的开放式词典的中文语言模型。基于这些技术,本文提出一个面向广播语音识别的语言模型自适应框架,该框架联合了以下技术:一种新的非迭代的新词提取方法,一种新的中文开放式词典语言模型,一种基于困惑度(PPL)的背景语料筛选方法和一个N-gram概率分布自适应模块。另外,本文还专门分析了在语言模型自适应过程中命名实体词的识别情况。实验表明,通过使用该框架,误识率相对下降了10%,实体词识别准确率提高了4%。
The purpose of language model (LM) adaptation is to reduce the linguistic mismatches between training corpus and recognition tasks. This paper proposed a new non-iterative new words extraction approach for Chinese and a novel open-vocabulary Chinese LM. To reduce lexicon mismatch, topic and stylc mismatch and n gram distribution mismatch, we also present a unified LM adaptation framework which combines our non-iterative new words extraction approach, a novel open-vocabulary Chinese LM, a perplexity-based corpus selection approach and an ngram distribution adaptation module. The recognition accuracy of name entity words is also analyzed as an effect of LM adaptation. Experiments showed about 10% relative character error rate reduction and 4% (absolute) recognition accuracy increase of name entity words.
出处
《中文信息学报》
CSCD
北大核心
2007年第4期73-79,共7页
Journal of Chinese Information Processing
基金
国家863计划资助项目(2006AA010103)
关键词
计算机应用
中文信息处理
语言模型自适应
新词提取
开放式词典
computer application
chinese information processing
language model adaptation
new words extraction
open-vocabulary LM