摘要
文章采用最大似然估计的方法对齐普夫分布曲线进行拟合。该方法对齐普夫定律的词谱分布,利用KS检验的方法得到在双对数坐标下拟合度最优的直线。与传统的最小二乘法相比,该方法拟合结果更为准确。为了验证该方法的有效性,通过3组中英文语料实验发现,英文较好地符合齐普夫定律,中文并不太符合。
This paper proposes a method of how to calculate the slope of Zipf's law based on maximum likelihood estimation.In this method,the frequency spectrum forms of Zipf's law is adopted for mathematic reasons and the Kolmogorov-Smirnov(KS)method is used to obtain a goodness-of-fit line in dual-logarithm coordinate.Compared with the traditional least square method,the maximum likelihood estimation method is more accurate in fitting results.To validate the method,the paper conducts an experiment with three Chinese and English corpuses.The experiment shows that the English words conform with the Zipf's law better,while the Chinese words do not conform with the Zipf's law.
出处
《情报理论与实践》
CSSCI
北大核心
2012年第11期6-11,共6页
Information Studies:Theory & Application
基金
"863"计划项目"科技文献服务为主的搜索引擎研制"(项目编号:2011AA01A206)
2011年南京大学研究生科研创新基金资助项目"中英双语文本聚类技术及其应用研究"(项目编号:2011CW12)的成果之一
关键词
齐普夫定律
最大似然估计
词谱分布
Zipf's law
maximum likelihood estimation
word frequency distribution