摘要
k均值聚类算法在对数据进行聚类时需要以确定的聚类个数和初始聚类中心为前提,但聚类个数是难以准确给定的,通常随机选取k个样本作为初始聚类中心,由于不同的初始聚类中心可能导致不同的聚类结果,采用随机选取初始聚类中心的方法存在着较大的盲目性,造成聚类结果极不稳定。为此,提出了一种基于划分的聚类个数与初始中心点的确定方法。该方法通过对数据空间进行划分,统计每个网格空间中数据点数目作为网格的数据密度,同时计算局部密度极大值的网格个数;按照不同的分度值对数据集进行划分,当局部密度极大值的网格个数趋于相对稳定时,将局部密度极大值的网格个数作为聚类个数,并同时获得聚类初始中心。基于机器学习数据库数据集以及随机生成的人工模拟数据集进行了仿真实验,实验结果表明,所提出的算法有效可行,具有较高的准确性。
The k-means clustering algorithm needs the determined clustering number and initial clustering center before data clustering. However, the clustering number is difficult to be accurately given. Since different initial clustering centers may lead to distinct clustering results,the randomly selective method of initial clustering centers exists blindness to make clustering results very instable. Therefore, a new algorithm for determining optimal number of clusters and initial centers with partitioning has been proposed ,in which partition of da- ta space has been conducted to take the statistical number of data marker inside each grid as the data density in the grid and count the grid number with local maximum density. The data set has been partitioned according to the different index value. While the number of local maximum density grid tends to be relatively stable, it can be considered as cluster number and initial cluster centers can be acquired mean- while. Simulation experiments for verification have been conducted with UCI data sets and random artificial data sets. The experimental results show that the proposed algorithm is effective and feasible with quite fine accuracy,
出处
《计算机技术与发展》
2017年第7期76-78,82,共4页
Computer Technology and Development
基金
国家自然科学基金资助项目(61471203
61101105)
教育部博士点基金(20113223120001)
江苏973项目(BK2011027)
关键词
K均值聚类
聚类个数
初始聚类中心
划分
k-means clustering
number of clustering
initial clustering centers
partitioning