摘要
针对二元模型,提出了一种基于互信息的回退(MI Back-off)平滑算法.从互信息的角度,分析词之间的搭配关系,根据模型中每个二元对的互信息对其概率进行不同程度的折扣,并利用低阶模型对零概率事件进行补偿,通过极小化困惑度的原则体现新算法的合理性.在不同类别测试集下,该平滑算法与传统Katz平滑算法相比,模型困惑度下降均超过20%.
A back-off smoothing algorithm based on mutual information for bigram model was presented. This algorithm not only analyzes the coupling relations between words from the perspective of mutual information, discounts the probabilities differently according to the mutual information of each bigram in the model, but also takes advantage of the low-order model to compensate for zero-probability case. Based on a very small degree of confusion prin- ciple, this algorithm was proved to be reasonable. For unseen events, the probabilities were back off to low-order model. Furthermore, the model parameters were estimated by minimizing the perplexity. In testing corpus of different domains, all the perplexities of the proposed smoothing algorithm decline more than 20% compared with the traditional Katz algorithm.
出处
《应用科技》
CAS
2009年第4期28-31,35,共5页
Applied Science and Technology
关键词
中文信息处理
统计语言模型
平滑算法
互信息
困惑度
information processing of Chinese characters
statistical language model
smoothing algorithm
mutual information
perplexity