一种改进的基于记忆的自适应汉语语言模型被引量：1

An Improved Cache-based Adaptive Chinese Language Model

下载PDF

导出

摘要基于记忆的自适应语言模型虽然在一定程度上增强了语言模型对不同领域的适应性 ,但其假设过于简单 ,即认为一个在文章的前面部分出现过的词往往会在后面重复出现。通过对一些文本的观察分析 ,我们认为作者在书写文章的时候 ,除了常常使用前文中出现过的词汇外 ,为了避免用词单调 ,还会在行文过程中使用前文出现过词汇的近义词或者同义词。另外 ,一篇文章总是围绕某个主题展开 ,所以在文章中出现的许多词汇往往在语义上有很大的相关性。我们对基于记忆的语言模型进行了扩展 ,利用汉语义类词典 ,将与缓存中所保留词汇语义上相近或者相关的词汇也引入缓存。实验表明这种改进在很大程度上提高了原有模型的性能 ,与n元语言模型相比困惑度下降了 4 0 1% ,有效地增强了语言模型的自适应性。 Even if n-grams language models were proved to be very powerful and robust in various tasks, they have a certain handicap that the dependency is limited to very short local context because of the Markov assumption. Though cache-based language models adapt to cross-domain environment very well, the hypothesis behind this language model is too simple. It assumes that a word that has been used often reappears in the same document. We extend this model by introducing the Chinese concept lexicon into it. The cache of the extended language model contains not only the words occurred recently but also the semantically related words. Experiments have shown that the performance of the adaptive model has been improved greatly and the perplexity has decreased almost 40.1% compared with n-gram language model.

作者张俊林孙乐孙玉芳

机构地区中国科学院软件研究所系统软件与中文信息中心

出处《中文信息学报》 CSCD 北大核心 2005年第1期8-13,共6页 Journal of Chinese Information Processing

基金国家自然科学基金资助项目 (6 0 2 0 30 0 7) 国家"十五"86 3重大项目资助 (2 0 0 1AA114 0 4 0 )

关键词人工智能自然语言处理语言模型自适应同义词词林困惑度 artificial intelligence natural language processing language model adaptive model Chinese thesaurus perplexity

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献8

1Ronald Rosenfeld, Two decades of statistical language modeling: Where do we go from here?[A].Proceedings of the IEEE[C], 88(8), 2000.
2DeMori, R., and M. Federico, Language Model Adaptation[A]. In: Computational Models of Speech Pattern Processing, Keith Pointing (ed.), NATO ASI Series, Springer Verlag, 1999.
3R. Kuhn and R. D. Moil, A cache-based natural language model for speech reproduction[J]. IEEE. Transactions on Pattern Analysis and Machine Intelligence, 1990, PAM2-12(6):570-583.
4Daniel Gildea and Thomas Hofmann, Topic-based language models using EM. In: Proceedings of the 6^th European Conference on Speech Comanunication and Technology(EUROPEANSPEECH)[ C ], 1999.
5P. Oarkson and A. Robinson, Language model adaption using mixture and an exponentially decaying cache. In Boc.ICASSP-97[C], 1997.
6K.C. Yang, T.H. Ho, L.F. Often, L.S. Lee, Statistcs-based segment pattern lexicon-a new direction for Chinese language modeling[A].In: Proc. IEF.E 1998 International Conference on Acoustic, Speech, Signal Processing[C], Seattle, WA, 1998,169-172.
7I. Witten, T. Bell, The zero-frequency problem: Estimating the probabilities of Novel Events in adaptive text compression[A].In: IEEE Transactions on Information theory[C]. 1991.37(4).
8A. P. Dempster, N. M: Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society B,39:1-38, 1977.

同被引文献13

1边肇祺张学工.模式识别[M].北京：清华大学出版社,2002..
2Saracevic,T,Relevance Reconsidered[A].In:P.Ingwersen and N.O.Pors.Information Science:Integration in Perspective[C],1996.
3Salton,G.The SMART Retrieval System:Experiments in Automatic Document Processing[M].Prentice-Hall Inc.,Englewood Cliffs,NL,1971.
4J.A.Bilmes.A Gentle Tutorial of the EM Algorithm and its application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models[R].Technical report,U.C.Berkeley,1998.
5Hugo Zaragoza,Djoerd Hiemstra and Michael Tipping.Bayesian Extension to the Language Model for Ad Hoc Information Retrieval[A].In:proceedings of SIGIR' 03[C] (2003).
6Lafferty,J.and Zhai,C.Document Language Models,Query Models,and Risk Minimization for Information Retrieval[A].In:proceedings of SIGIR'01[C].2001.
7D.Miller,T.Leek and R.M.Schwartz.A Hidden Markov Model Information Retrieval System[A].In:proceedings of SIGIR'99[C].1999.
8V.I.Levenshtein.Binary codes capable of correcting spurious insertions and deletions of ones (original in Russian)[A].Russian Problemy Peredachi Informatsii 1[C],pp.12-25,1965.
9P.Yianilos.The LikeIt intelligent string comparison facility[R].NEC Institute Tech Report 97 -093,1997.
10E.Spertus.ParaSite:Mining structural information on the web[A].In:proceeding of The Sixth International World Wide web Conference[C].1997.

引证文献1

1李晓光,于戈,王大玲.基于混合语言模型的文档相似性计算模型[J].中文信息学报,2006,20(4):41-48. 被引量：2

二级引证文献2

1胡艳波,崔新春,路青.2002～2011年国内语义相似度研究计量分析[J].情报科学,2013,31(7):100-105.
2钱亚冠,方科彬,康明,顾钊铨,潘俊,王滨,Wassim Swaileh.一种应用于文本分类的段落向量正向激励方法[J].中文信息学报,2023,37(7):51-60.

1曲卫民,张俊林,孙乐.基于主题的汉语语言模型的研究[J].计算机研究与发展,2003,40(9):1368-1374. 被引量：3
2黄永文,何中市.基于互信息的统计语言模型平滑技术[J].中文信息学报,2005,19(4):46-51. 被引量：8
3王龙,杨俊安,陈雷,林伟.基于循环神经网络的汉语语言模型建模方法[J].声学技术,2015,34(5):431-436. 被引量：5
4郭蓝天,李扬,慕德俊,杨涛,李哲.一种基于LDA主题模型的话题发现方法[J].西北工业大学学报,2016,34(4):698-702. 被引量：21
5赵知纬,钱龙华,周国栋.一个面向信息抽取的中文跨文本指代语料库[J].中文信息学报,2015,29(1):57-66. 被引量：3
6刘章,陈小平.联合无监督词聚类的递归神经网络语言模型[J].计算机系统应用,2014,23(5):101-106. 被引量：1
7梁华参,赵铁军.统计机器翻译中双语语料的过滤及词对齐的改进[J].智能计算机与应用,2013,3(4):10-13. 被引量：3
8肖镜辉,王晓龙,刘秉权.一种基于相似度的汉语语言模型平滑技术及其在音字转换中的应用[J].高技术通讯,2006,16(2):127-132.
9曲卫民,张俊林,孙乐,孙玉芳.基于记忆的自适应汉语语言模型的研究[J].中文信息学报,2003,17(5):13-18. 被引量：2
10吐尔根·依步拉音,吾守尔·斯拉木,麦合甫热提,艾山·吾买尔.词典和统计相结合的维吾尔文拼写查错方法的研究[J].新疆大学学报（自然科学维文版）,2012(1):1-10.

中文信息学报

2005年第1期

浏览历史

内容加载中请稍等...

一种改进的基于记忆的自适应汉语语言模型被引量：1

参考文献8

同被引文献13

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种改进的基于记忆的自适应汉语语言模型 被引量：1

参考文献8

同被引文献13

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种改进的基于记忆的自适应汉语语言模型被引量：1