摘要
数据平滑主要是用来解决统计语言模型在实际应用中数据稀疏问题。现有平滑技术虽然已有效地对数据稀疏问题进行了处理,但对已出现事件频率分布的合理性并没有作出有效的分析。本文则针对二元模型,提出了一种基于互信息的平滑技术,其基本思想是根据模型中每个二元对的互信息的高低对其概率进行折扣或补偿,并用极小化困惑度原则体现了模型的合理性。实验结果表明该技术优于目前常用的Katz平滑技术。
Smoothing techniques are mainly used to solve the problem of sparse data for statistical language model. The present smoothing techniques have solved the data sparse problem effectively but have not further analyzed the reasonableness for the frequency distribution of events occurring. This paper presents a new kind of smoothing technique based on the mutual information for Bi-gram model. The model parameters, probabilities for bigram, are discounted or compensated according to the mutual information, whose rationality is indicated by minimizing the perplexity. The experimental results show that this technique outperforms the commonly used Katz smoothing technique.
出处
《中文信息学报》
CSCD
北大核心
2005年第4期46-51,共6页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60173060)
关键词
计算机应用
中文信息处理
统计语言模型
平滑技术
互信息
困惑度
computer application
Chinese information processing
statistical language model
smoothing technique
mutual information
perplexity