摘要
本文提出并实现了一种大规模汉语语料库中字、词级任意n的n-gram统计算法,本算法可以一次性统计出所有不大于任意n(本文n取为256)的字、词级n-gram,可将传统n-gram统计时的指数空间开销变为线性的,且与所统计的元数无关。基于这种n-gram的统计,本文还进行了汉语信息熵的计算及字、词级知识获取的研究。
A new algorithm of n gram statistics for arbitrary n at word or phrase level is proposed and realized in this paper,with which the n gram for all n at word or phrase level can be calculated at the same time. Based on the n gram,the Chinese information entropy and knowledge acquisition at word or phrase level have also been studied.The algorithm and its result have been integrated with a MT system.
出处
《情报学报》
CSSCI
北大核心
1997年第1期28-35,共8页
Journal of the China Society for Scientific and Technical Information