摘要
由互联网时代快速发展而产生的海量数据给传统聚类方法带来了巨大挑战,如何改进聚类算法从而获取有效信息成为当前的研究热点。K-Medoids是一种常见的基于划分的聚类算法,其优点是可以有效处理孤立、噪声点,但面临着初始中心敏感、容易陷入局部最优值、处理大数据时的CPU和内存瓶颈等问题。为解决上述问题,提出了一种MapReduce架构下基于遗传算法的K-Medoids聚类。利用遗传算法的种群进化特点改进K-Medoids算法的初始中心敏感的问题,在此基础上,利用MapReduce并行遗传K-Medoids算法提高算法效率。通过带标签的数据集进行实验的结果表明,运行在Hadoop集群上的基于MapReduce和遗传算法的K-Medoids算法能有效提高聚类的质量和效率。
Huge volumes of data are increasing exponentially with the rapid development of Intemet,which poses signifi- cant challenges to traditional clustering technologies. Thus, improving the accuracy and computing performance of clus- tering has become a research hotspot. As one of the partition-based clustering algorithms, K-Medoids can effectively deal with the problems with isolate and noise points. However,it also suffers from problems such as sensitive to initial centers, easily falling into local optimum, CPU and memory bottlenecks with big data sets. We proposed a genetic algo- rithm based K-Medoids clustering under MapReduce framework. The algorithm solves the center sensitivity problem of the K-Medoids by using the genetic algorithm. Also, it is built on the MapReduce framework to boost the efficiency both for K-Medoids and the genetic algorithm. The experiments demonstrate that the proposed algorithm can effectively im- prove the quality and efficiency of clustering.
作者
赖向阳
宫秀军
韩来明
LAI Xiang-yang GONG Xiu-jun HAN Lai-ming(College of Computer Science and Technology, Tianjin University, Tianjin 300072, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin 300072, China)
出处
《计算机科学》
CSCD
北大核心
2017年第3期23-26,58,共5页
Computer Science
基金
国家自然科学基金项目(61170177)
国家重点基础研究发展计划项目(2013CB32930X)资助