摘要
随着信息技术的不断进步,数据规模不断增大。聚类是一种典型的数据分析方法,尤其是对大规模数据进行聚类分析近年来备受关注。针对现有序列聚类算法在对大规模数据进行聚类时,在内存空间和计算时间方面开销较大的问题,提出了基于MapReduce的人工蜂群聚类算法,通过引入MapReduce并行编程范式,快速计算聚类中心适应度,可实现对大规模数据的高效聚类。基于仿真和真实的磁盘驱动器制造两类数据,对算法的聚类效果、可扩展性和聚类效率进行了验证。实验结果表明,与现有PK-Means算法和并行K-PSO算法相比,论文算法具有更好的聚类效果、更强的扩展性和更高的聚类效率。
With the development of information technology,the scale of digital data is increasing.Clustering is a typical data analysis technology for large-scale data.In recent years,the clustering technology is increasingly concerned.The computational cost of most sequential clustering algorithms is expensive in terms of memory space and the time complexities.In this paper,an improved artificial bee colony based on MapReduce for large-scale data clustering is proposed.The MapReduce programming paradigm is in troduced in this algorithm to calculate the fitness.The quality,scalability and efficiency of the proposed algorithm are tested by us ing two datasets,the synthetic dataset and the manufacturing dataset obtained from a disk drive manufacturing process.Experimen tal results show that this algorithm performs better in clustering effect,s calability and computational efficiency compared with PK-Means and parallel K-PSO.
作者
李果
袁小凯
许爱东
张乾坤
张福铮
LI Guo;YUAN Xiaokai;XU Aidong;ZHANG Qiankun;ZHANG Fuzheng(Southern Power Grid Institute of Science,Guangzhou 510080)
出处
《计算机与数字工程》
2020年第1期124-129,146,共7页
Computer & Digital Engineering
基金
国家自然科学基金项目(编号:61672393)资助
关键词
大数据
MAPREDUCE
人工蜂群
聚类
并行编程范式
large-scale datasets
MapReduce
artificial bee colony
lustering
parallel programming paradigm