期刊文献+

基于互信息的统计语言模型平滑技术 被引量:8

A Smoothing Technique for Statistical Language Model Based on Mutual Information
下载PDF
导出
摘要 数据平滑主要是用来解决统计语言模型在实际应用中数据稀疏问题。现有平滑技术虽然已有效地对数据稀疏问题进行了处理,但对已出现事件频率分布的合理性并没有作出有效的分析。本文则针对二元模型,提出了一种基于互信息的平滑技术,其基本思想是根据模型中每个二元对的互信息的高低对其概率进行折扣或补偿,并用极小化困惑度原则体现了模型的合理性。实验结果表明该技术优于目前常用的Katz平滑技术。 Smoothing techniques are mainly used to solve the problem of sparse data for statistical language model. The present smoothing techniques have solved the data sparse problem effectively but have not further analyzed the reasonableness for the frequency distribution of events occurring. This paper presents a new kind of smoothing technique based on the mutual information for Bi-gram model. The model parameters, probabilities for bigram, are discounted or compensated according to the mutual information, whose rationality is indicated by minimizing the perplexity. The experimental results show that this technique outperforms the commonly used Katz smoothing technique.
出处 《中文信息学报》 CSCD 北大核心 2005年第4期46-51,共6页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60173060)
关键词 计算机应用 中文信息处理 统计语言模型 平滑技术 互信息 困惑度 computer application Chinese information processing statistical language model smoothing technique mutual information perplexity
  • 相关文献

参考文献5

二级参考文献3

共引文献24

同被引文献68

  • 1李亭骞,曹渠江.Windows平台下的汉字输入法机制及应用[J].计算机应用与软件,2006,23(1):40-42. 被引量:2
  • 2顾平,朱巧明,李培峰,钱培德.智能型汉字数码输入技术的研究[J].中文信息学报,2006,20(4):100-105. 被引量:7
  • 3张仰森,曹元大,俞士汶.语言模型复杂度度量与汉语熵的估算[J].小型微型计算机系统,2006,27(10):1931-1934. 被引量:7
  • 4杨琳,张建平,颜永红.特定领域的汉语语言模型平滑算法比较研究[J].计算机工程与应用,2006,42(32):14-16. 被引量:5
  • 5赵华,赵铁军,张姝,王浩畅.基于内容分析的话题检测研究[J].哈尔滨工业大学学报,2006,38(10):1740-1743. 被引量:20
  • 6LIDSTONE G J. Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities [ J ]. Transactions of the Faculty of Actuaries, 1920, 8 : 182-192.
  • 7KATZ S M. Estimation of probabilities from sparse data for the language model component of a speech recognizer [ J ]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1987, 35(3) : 400-401.
  • 8GOODMAN J, CHEN S F. An empirical study of smoothing techniques for language modeling [ J ]. Computer Speech and Language, 1999, 13(4):359-393.
  • 9CHURCH K W, GALE W A. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of English bigrams [ J ]. Computer Speech and Language, 1991,5(1):19-54.
  • 10JELINEK F, MERCER R L. Interpolated estimation of Markov source parameters from sparse data[ C]// Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam, 1980: 381-397.

引证文献8

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部