摘要
给出了一种基于快速排序和归并排序的高阶汉语大标记集Markov统计语言模型的统计算法,并对算法的时间复杂性和空间复杂性进行了分析。依据这种统计算法,设计实现了一个汉语字(词)概率统计系统。通过对上千万字的汉语语料的统计,建立起了汉语字(词)一元、二元和三元Markov模型,并对统计结果进行了分析。
This paper puts forward an algorithm which combines quick sorting and merge sorting to construct a large symbol set of Chinese character/word Markov Language models The time and the space complexity are discussed According to the algorithm,a Chinese character/word probability distribution computing system is introduced The unigram,bigram and trigram Chinese larguage models based on more than twenty million Chinese characters,and the results are analyzed From the experimental results we find statistical language models have a good performance in approaching the near constraint relationship of the Chinese language
出处
《哈尔滨工业大学学报》
EI
CAS
CSCD
北大核心
1997年第5期23-27,共5页
Journal of Harbin Institute of Technology
基金
国家八六三高技术计划
霍英东基金