期刊文献+

基于相似度的词聚类算法和可变长语言模型 被引量:7

Word Clustering Based on Similarity and Vari-gram Language Model
下载PDF
导出
摘要 基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法.传统的统计聚类方法基于含婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准.这种传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优.本文利用互信息定义一种词相似度,基于相似度,提出一种自下而上的分层聚类算法.实验证明,该算法在计算复杂度和聚类效果上比传统的基于贪婪原则的统计聚类算法都有明显的改进.在提高预测能力方面,提出一种新的基于类的可变长语言模型(Vari-gram)的生成方法. Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed,and initial choices can influence final result. The author tries to solve above problems in this paper, and presents a definition of word similarity by utilizing mutual information. Based on word similarity, a bottom-up hierarchical clustering algorithm is proposed. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, a new method to create the vari-gram language model is presented.
作者 袁里驰
出处 《小型微型计算机系统》 CSCD 北大核心 2009年第5期912-915,共4页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(60763001 60663007)资助 中南大学博士后科学基金项目(2007)资助
关键词 互信息 词相似度 聚类算法 可变长语言模型 mutual information word similarity clustering algorithm vari-gram language model
  • 相关文献

参考文献9

  • 1Takuya Matsuzaki,Yusuke Miyao.An efficient clustering algorithm for class-based language models[C].Proceedings of the 7th Conf on Natural Language Learning at HLT-NAACL,2003,119-126.
  • 2Ido Dagan.Context word similarity and estimation from sparse data[J].Computer Speech and Language,1995,9(2):123-152.
  • 3Firth,John Rupert.A synopsis of linguistic theory 1930-1955.In Philological Society,editor[A].Studies in linguistic analysis.Blackwell,Oxford,1957:1-32.Reprinted in Selected Papers of J.R.Firth,edited by Palmer F[M].Longman,1968.
  • 4Christopher D Manning,Hinrich Schutze.Foundations of statistical natural language processing[M].London:The MIT Press,1999.
  • 5Cutting D R,Karger D R,Perdersen J R,et al.Scatter/garther:a cluster-based approach to browsing large document collections[C].Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1992:318-329.
  • 6Gao J,Wang H F,Lee K F.A unified approach to statistical language modeling for Chinese[C].Proceedings of the IEEE International Conference on Acoustics,Speech,and Signal Processing,Istanbul,Turkey,June,2000:1703-1706.
  • 7Lee Lillian.Similarity-based approaches to natural language processing[D].Harvard University,Cambridge,MA,1997.
  • 8Karov Yael,Shimon Edelman.Learning similarity-based word sense disambiguation from sparse data[C].Proceedings of the Fourth Workshop on Very Large Corpora,Copenhagen,Denmark,1996:42-55.
  • 9Niesler T R,Woodland P C.A variable-length category-based n-gram language model[C].Proceedings of the International Conference of Acoustics Speech and Signal Processing.Atlanta,1996:164-169.

同被引文献76

引证文献7

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部