摘要
事务型数据的CLOPE聚类算法在运行速度、内存开销和聚类效果方面表现优异,但随着数据量飞速增长,其运行时间也随之急剧变长甚至无法使用。为此,利用Hadoop框架下的YARN资源管理系统,对CLOPE算法进行改进,提出基于MapReduce架构的CLOPE并行聚类算法。该算法由两个阶段组成,第一阶段执行Map操作,Hadoop架构对数据集分片并行并运行CLOPE算法聚类成小聚簇;第二阶段执行Reduce操作,通过多次迭代把各个小聚簇聚合成大聚簇。实验结果证明:分析1 000条20 000个属性的亚马逊数据记录,MapReduce-CLOPE算法耗时稳定在22 s,而CLOPE算法耗时在50-60 s。随着数据量的增大,CLOPE算法无法计算而MapReduce-CLOPE算法耗时基本稳定。因此,MapReduce-CLOPE算法在计算时间方面要显著地优于CLOPE算法,且计算时间受数据量大小的影响较小,而在聚类质量方面与CLOPE算法相近。
A CLOPE parallel algorithm based on MapReduce ( MapReduce-CLOPE) is presented in this paper. The algorithm consists of two phases:. In the first phase, the large datasets on Hadoop are split into multiple small data blocks by Map operations. and the CLOPE algorithm is executed on each data block in parallel to form small clusters. In the second phase, the algorithm will merge the small clusters into multiple large clusters through multiple iterations, by executing Reduce opera-tions. The experiments show that it takes 22 seconds steadily in MapReduce-CLOPE algorithm when analyzing 1 000 Amazon data records of 20 000 attributes, while it takes between 50 and 60 seconds in CLOPE algorithm. With the data volume increasing, CLOPE algorithm cannot finish the calcula-tion, however, MapReduce-CLOPE algorithm can get the calculation with stable time. Therefore, MapReduce-CLOPE algorithm is superior significantly than CLOPE algorithm in the time and the influence of data volume, and it’s close to CLOPE algorithm in clustering quality.
出处
《广西大学学报(自然科学版)》
CAS
北大核心
2016年第5期1567-1575,共9页
Journal of Guangxi University(Natural Science Edition)
基金
国家自然科学基金资助项目(71301101)
交通运输部建设科技项目(2015328810160)
上海市科委重点项目(14DZ2280200)