摘要
目的针对DNA词频分析中序列分布问题,探讨对基因组内不同序列的分布差异进行量化的可行性。方法该研究采用数值模拟的方法对Kolmogorov-Smirnov检验的统计量和累积概率曲线下图形的图心进行了比较。结果随着样本含量的增加,两个指标的离散趋势逐渐减小,但其集中趋势并没有受到明显影响,且不同的分布集中于不同的位置;当样本含量为100时,所能判别的最小统计量差异约为0.1,图心差异约为0.02;使用统计量指标时,需采用两个基准分布才能将5个待测分布分开,而图心指标可以直接将5个待测分布分开。结论两个指标都可以看作分布差异的量化指标,但在大多数情况下样本含量应该大于100;当需要在同一坐标系表示不同分布时,图心可能是一个较好的选择。
Objective In view of the DNA sequence distribution problem in the word frequency analysis ,to discuss the feasibility that to quantify the differences in the distributions of the sequences in genome. Methods The numerical simulation was used to compare two indexes ,the statistic of KolmogorovSmirnov test and the centroid of the figure under the cumulative probability curves. Results With the increase of the sample size, the discrete trend of two indexes was gradually reduced, but their central tendency was not affected obviously, and the different distributions concentrated at the different locations; When the sample size was equal to 100 ,the discriminating minimum difference of statistics was about 0. 1 ,the difference of centroids was about 0. 02; When statistics were adopted, two reference distributions had to be used to separate five distributions,but centroids can separate five distributions directly. Conclusion Both of two indexes can be seen as quantitative indicators of the differences of distributions,but in most cases, the sample size should be greater than 100; The centroid is possible a better choice, when some different distributions are marked in the same coordinate system.
出处
《中国卫生统计》
CSCD
北大核心
2014年第4期554-558,共5页
Chinese Journal of Health Statistics
基金
国家自然科学基金资助项目(31071156)