摘要
聚类分析是数据挖掘中最重要的技术之一,它在社会经济的各个领域都具有重要作用,并被广泛应用。K均值算法是最经典、应用最广泛的聚类方法之一,但其缺点是过度依赖初始条件和聚类数目难以确定,这制约了其应用范围。引入簇的内聚度和耦合度的定义与度量方法,基于"高内聚低耦合"的原理,在二分K均值聚类过程中对得到的簇进行不断的分裂和合并,并判断聚类结果是否满足要求以确定聚类的次数和簇的个数,从而实现对二分K均值聚类过程的改进。在Iris数据集上的实验测试与分析表明该算法不仅更加稳定,而且其聚类结果的正确率也较高。
Clustering analysis is one of the most important techniques in data mining.It has important role and wide application in every field of social economy.K-means is one kind of the simple and widely used clustering methods,but its disadvantage is that it depends on the initial conditions and the number of clusters is difficult to determine.This paper introduced the cohesion and coupling of cluster,and presented the measurement of cohesion and coupling.Based on the principle of"high cohesion and low coupling",the clusters are constantly divided and merged in the process of bisecting K-Means clustering algorithm.By judging whether the clustering results meet the requirements,it can determine the number of clusters,thus improving the bisecting K-Means clustering algorithm.The experimental results on Iris data show that the algorithm is not only more stable,but also has higher clustering accuracy.
作者
郁湧
康庆怡
陈长赓
阚世林
骆永军
YU Yong1,2 ,KANG Qing -yi1, CHEN Chang -geng1,KAN Shi- lin1, LUO Yong- jun(2School of Software, Yunnan University ,Kunming G50504 ,China;2Key Laboratory for Software Engineering of Yunnan Province,Kunming 650504,Chin)
出处
《计算机科学》
CSCD
北大核心
2018年第B06期460-464,共5页
Computer Science
基金
国家自然科学基金项目(61462091)
云南大学数据驱动的软件工程省科技创新团队项目(2017HC012)资助
关键词
聚类
二分k均值
内聚度
耦合度
Clustering
Bisecting K -means
Cohesion
Coupling