期刊文献+

一种改进的基于记忆的自适应汉语语言模型 被引量:1

An Improved Cache-based Adaptive Chinese Language Model
下载PDF
导出
摘要 基于记忆的自适应语言模型虽然在一定程度上增强了语言模型对不同领域的适应性 ,但其假设过于简单 ,即认为一个在文章的前面部分出现过的词往往会在后面重复出现。通过对一些文本的观察分析 ,我们认为作者在书写文章的时候 ,除了常常使用前文中出现过的词汇外 ,为了避免用词单调 ,还会在行文过程中使用前文出现过词汇的近义词或者同义词。另外 ,一篇文章总是围绕某个主题展开 ,所以在文章中出现的许多词汇往往在语义上有很大的相关性。我们对基于记忆的语言模型进行了扩展 ,利用汉语义类词典 ,将与缓存中所保留词汇语义上相近或者相关的词汇也引入缓存。实验表明这种改进在很大程度上提高了原有模型的性能 ,与n元语言模型相比困惑度下降了 4 0 1% ,有效地增强了语言模型的自适应性。 Even if n-grams language models were proved to be very powerful and robust in various tasks, they have a certain handicap that the dependency is limited to very short local context because of the Markov assumption. Though cache-based language models adapt to cross-domain environment very well, the hypothesis behind this language model is too simple. It assumes that a word that has been used often reappears in the same document. We extend this model by introducing the Chinese concept lexicon into it. The cache of the extended language model contains not only the words occurred recently but also the semantically related words. Experiments have shown that the performance of the adaptive model has been improved greatly and the perplexity has decreased almost 40.1% compared with n-gram language model.
出处 《中文信息学报》 CSCD 北大核心 2005年第1期8-13,共6页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目 (6 0 2 0 30 0 7) 国家"十五"86 3重大项目资助 (2 0 0 1AA114 0 4 0 )
关键词 人工智能 自然语言处理 语言模型 自适应 同义词词林 困惑度 artificial intelligence natural language processing language model adaptive model Chinese thesaurus perplexity
  • 相关文献

参考文献8

  • 1Ronald Rosenfeld, Two decades of statistical language modeling: Where do we go from here?[A].Proceedings of the IEEE[C], 88(8), 2000.
  • 2DeMori, R., and M. Federico, Language Model Adaptation[A]. In: Computational Models of Speech Pattern Processing, Keith Pointing (ed.), NATO ASI Series, Springer Verlag, 1999.
  • 3R. Kuhn and R. D. Moil, A cache-based natural language model for speech reproduction[J]. IEEE. Transactions on Pattern Analysis and Machine Intelligence, 1990, PAM2-12(6):570-583.
  • 4Daniel Gildea and Thomas Hofmann, Topic-based language models using EM. In: Proceedings of the 6^th European Conference on Speech Comanunication and Technology(EUROPEANSPEECH)[ C ], 1999.
  • 5P. Oarkson and A. Robinson, Language model adaption using mixture and an exponentially decaying cache. In Boc.ICASSP-97[C], 1997.
  • 6K.C. Yang, T.H. Ho, L.F. Often, L.S. Lee, Statistcs-based segment pattern lexicon-a new direction for Chinese language modeling[A].In: Proc. IEF.E 1998 International Conference on Acoustic, Speech, Signal Processing[C], Seattle, WA, 1998,169-172.
  • 7I. Witten, T. Bell, The zero-frequency problem: Estimating the probabilities of Novel Events in adaptive text compression[A].In: IEEE Transactions on Information theory[C]. 1991.37(4).
  • 8A. P. Dempster, N. M: Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society B,39:1-38, 1977.

同被引文献13

  • 1边肇祺 张学工.模式识别[M].北京:清华大学出版社,2002..
  • 2Saracevic,T,Relevance Reconsidered[A].In:P.Ingwersen and N.O.Pors.Information Science:Integration in Perspective[C],1996.
  • 3Salton,G.The SMART Retrieval System:Experiments in Automatic Document Processing[M].Prentice-Hall Inc.,Englewood Cliffs,NL,1971.
  • 4J.A.Bilmes.A Gentle Tutorial of the EM Algorithm and its application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models[R].Technical report,U.C.Berkeley,1998.
  • 5Hugo Zaragoza,Djoerd Hiemstra and Michael Tipping.Bayesian Extension to the Language Model for Ad Hoc Information Retrieval[A].In:proceedings of SIGIR' 03[C] (2003).
  • 6Lafferty,J.and Zhai,C.Document Language Models,Query Models,and Risk Minimization for Information Retrieval[A].In:proceedings of SIGIR'01[C].2001.
  • 7D.Miller,T.Leek and R.M.Schwartz.A Hidden Markov Model Information Retrieval System[A].In:proceedings of SIGIR'99[C].1999.
  • 8V.I.Levenshtein.Binary codes capable of correcting spurious insertions and deletions of ones (original in Russian)[A].Russian Problemy Peredachi Informatsii 1[C],pp.12-25,1965.
  • 9P.Yianilos.The LikeIt intelligent string comparison facility[R].NEC Institute Tech Report 97 -093,1997.
  • 10E.Spertus.ParaSite:Mining structural information on the web[A].In:proceeding of The Sixth International World Wide web Conference[C].1997.

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部