Gap statistic is a well-known index of clustering validity, but its realization is difficult to be comprehended and accurately determined. A direct method is presented to improve the performance of the Gap statistic, ...Gap statistic is a well-known index of clustering validity, but its realization is difficult to be comprehended and accurately determined. A direct method is presented to improve the performance of the Gap statistic, which applies the two-order difference of within-cluster dispersion to replace the constructed null reference distribution in the Gap statistic. Hence, the realization of the Gap statistic becomes easy and is reformulated, and its uncertainty in applications is reduced. Also, the limitation of the Gap statistic is analyzed by two typical examples, that is, the Gap statistic is difficult to be applied to the dataset that contains strong-overlap or uneven-density clusters. Experiments verify the usefulness of the proposed method.展开更多
基金National Natural Science Foundation of China(No.60572065, 60772080, 60532020)
文摘Gap statistic is a well-known index of clustering validity, but its realization is difficult to be comprehended and accurately determined. A direct method is presented to improve the performance of the Gap statistic, which applies the two-order difference of within-cluster dispersion to replace the constructed null reference distribution in the Gap statistic. Hence, the realization of the Gap statistic becomes easy and is reformulated, and its uncertainty in applications is reduced. Also, the limitation of the Gap statistic is analyzed by two typical examples, that is, the Gap statistic is difficult to be applied to the dataset that contains strong-overlap or uneven-density clusters. Experiments verify the usefulness of the proposed method.