期刊文献+

大规模汉语语料库中任意n的n-gram统计算法及知识获取方法 被引量:4

Algorithm of n gram Statistics for Arbitrary n and Knowledge Acquisition Based on Statistics
下载PDF
导出
摘要 本文提出并实现了一种大规模汉语语料库中字、词级任意n的n-gram统计算法,本算法可以一次性统计出所有不大于任意n(本文n取为256)的字、词级n-gram,可将传统n-gram统计时的指数空间开销变为线性的,且与所统计的元数无关。基于这种n-gram的统计,本文还进行了汉语信息熵的计算及字、词级知识获取的研究。 A new algorithm of n gram statistics for arbitrary n at word or phrase level is proposed and realized in this paper,with which the n gram for all n at word or phrase level can be calculated at the same time. Based on the n gram,the Chinese information entropy and knowledge acquisition at word or phrase level have also been studied.The algorithm and its result have been integrated with a MT system.
出处 《情报学报》 CSSCI 北大核心 1997年第1期28-35,共8页 Journal of the China Society for Scientific and Technical Information
关键词 N元语法 统计 信息熵 知识获取 汉语语料库 n gram,statistics,information entropy,knowledge acquisition
  • 相关文献

参考文献1

二级参考文献3

共引文献3

同被引文献29

  • 1罗宇辉,侯汉清.因特网经济学未登录词计算机辅助挖掘试验[J].情报理论与实践,2005,28(5):478-481. 被引量:2
  • 2Information studies.[2004-11-29].http://informationstudies.blogchina.com/.
  • 3Brown M K.Kellner A,RaggeR D.Stochastic language models (N-Gram)specification.[2007-04-10].http://www.w3.org/TR/2001,WD-ngram-spec-20010103/.
  • 4Brown P E,Vincent J,Pietra D.Class-based n-gram models of natural language.Computational Linguistics,1992,18(4):467-479.
  • 5Merkel M, Andersson M. Knowledge-lite extraction of multi-word units with language filters and entropy thresholds[A]. Proceedings of 2000 Conference on User-Oriented Content-Based Text and Image Handling[C]. Paris, France:ACM Press, 2000. 737-746.
  • 6He S,Zhu J. An iterative method for extracting Chinese unknown words[J]. Chinese Journal of Electronics,2001,10(4):461-464.
  • 7Nagao M,Mori S. A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese[A]. Proceedings from the 15th International Conference on Computational Linguistics[C]. Kyoto: ACL,1994.
  • 8Zeng D,Wei Dong-hua,Chau M,et al.Domain-specific Chinese word segmentation using suffix tree and mutual information[J].Information System Frontier,2011,13:115-125.
  • 9CCL语料库[OL].http://ccl.pku.edu cn:8080/cclcorpus.
  • 10Nagao M,Mori S.A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese[C] //Proceedings of the 1Sth International Conference on Computational Linguistics.1994:611-615.

引证文献4

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部