摘要
聚类分析是数据挖掘领域的一个重要研究方向。已经有多种用于大规模数据库的聚类算法,CURE就是一个典型的代表。本文对CURE进行了改进,新方法用多点表示一个类,但舍弃了代表点收缩的过程;通过对类内最邻近距离统计特征的分析,提出了自动分离子类的方法,因而不用预先给定聚类个数;在CURE对原始数据进行随机采样和分区聚类的基础上,增加了划分网格一步,能降低噪声影响并缩短聚类时间。对二维数据的测试表明:改进的CURE能正确识别大多数类,速度上优于原算法。
Clustering is an important tool of Data Mining. CURE is a classical hierarchical method that is designed for the mining of very large database. In this paper, CURE is improved in three aspects. We use several representatives to figure a cluster but abandon the shrinking process. After analyzing the statistical characteristics of a cluster's 1-DIST, we present a new cluster isolating criterion which can automatically determine the number of clusters. We add grid method together with the CURE's sampling and partitioning technique to deal with the original data. The grid method can not only dampen the impact of noise but also reduce the time needed for clustering. Experiments on 2-dimcnsion datasets show that the improved CURE outperformed CURE in speed and the ability of discover arbitrary clusters with shapes.
出处
《内蒙古石油化工》
CAS
2005年第8期12-15,共4页
Inner Mongolia Petrochemical Industry