期刊文献+

基于MapReduce的CLOPE并行聚类算法

A CLOPE parallel clustering algorithm based on MapReduce
下载PDF
导出
摘要 事务型数据的CLOPE聚类算法在运行速度、内存开销和聚类效果方面表现优异,但随着数据量飞速增长,其运行时间也随之急剧变长甚至无法使用。为此,利用Hadoop框架下的YARN资源管理系统,对CLOPE算法进行改进,提出基于MapReduce架构的CLOPE并行聚类算法。该算法由两个阶段组成,第一阶段执行Map操作,Hadoop架构对数据集分片并行并运行CLOPE算法聚类成小聚簇;第二阶段执行Reduce操作,通过多次迭代把各个小聚簇聚合成大聚簇。实验结果证明:分析1 000条20 000个属性的亚马逊数据记录,MapReduce-CLOPE算法耗时稳定在22 s,而CLOPE算法耗时在50-60 s。随着数据量的增大,CLOPE算法无法计算而MapReduce-CLOPE算法耗时基本稳定。因此,MapReduce-CLOPE算法在计算时间方面要显著地优于CLOPE算法,且计算时间受数据量大小的影响较小,而在聚类质量方面与CLOPE算法相近。 A CLOPE parallel algorithm based on MapReduce ( MapReduce-CLOPE) is presented in this paper. The algorithm consists of two phases:. In the first phase, the large datasets on Hadoop are split into multiple small data blocks by Map operations. and the CLOPE algorithm is executed on each data block in parallel to form small clusters. In the second phase, the algorithm will merge the small clusters into multiple large clusters through multiple iterations, by executing Reduce opera-tions. The experiments show that it takes 22 seconds steadily in MapReduce-CLOPE algorithm when analyzing 1 000 Amazon data records of 20 000 attributes, while it takes between 50 and 60 seconds in CLOPE algorithm. With the data volume increasing, CLOPE algorithm cannot finish the calcula-tion, however, MapReduce-CLOPE algorithm can get the calculation with stable time. Therefore, MapReduce-CLOPE algorithm is superior significantly than CLOPE algorithm in the time and the influence of data volume, and it’s close to CLOPE algorithm in clustering quality.
出处 《广西大学学报(自然科学版)》 CAS 北大核心 2016年第5期1567-1575,共9页 Journal of Guangxi University(Natural Science Edition)
基金 国家自然科学基金资助项目(71301101) 交通运输部建设科技项目(2015328810160) 上海市科委重点项目(14DZ2280200)
关键词 数据挖掘 CLOPE MAPREDUCE 聚类算法 HADOOP data mining CLOPE MapReduce clustering algorithm Hadoop
  • 相关文献

参考文献8

二级参考文献92

  • 1李洁,高新波,焦李成.模糊CLOPE算法及其参数优选[J].控制与决策,2004,19(11):1250-1254. 被引量:4
  • 2Li Jie Gao Xinbo Jiao Licheng.A FUZZY CLOPE ALGORITHM AND ITS OPTIMAL PARAMETER CHOICE[J].Journal of Electronics(China),2006,23(3):384-388. 被引量:1
  • 3VARIA J. Cloud architectures - Amazon Web services [ EB/OL]. [ 2009 - 03 - 01 ]. http://acmbangalore, org/events/monthly-talk/ may-2008 --cloud-architectures---amazon-web-services. html.
  • 4BRYANT R E. Data-intensive supercomputing: The case for DISC, CMU-CS-07-128 [ R]. Pittsburgh, PA, USA: Carnegie Mellon University, Department of Computer Science, 2007.
  • 5SZALAY A S, KUNSZT P, THAKAR A, et al. Designing and mining multi-terabyte astronomy archives: The sloan digital sky survey [ C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2000:451 - 462.
  • 6BARROSO L A, DEAN J, HOLZLE U. Web search for a planet: The Google cluster architecture [ J]. IEEE Micro, 2003, 23(2) : 22 -28.
  • 7GILES J. Google tops translation ranking [ EB/OL]. (2006 - 11 - 06) [ 2009 - 03 - 06 ]. http://www, nature, com/news/2006/ 061106/full/news061106-6. html.
  • 8维基百科.Cloud computing [ EB/OL]. [ 2009 - 03 - 10]. http://en. wikipedia, org/wiki/Cloud_computing.
  • 9中国云计算网.什么是云计算?[EB/OL].(2008-05-14)[2009-02-27].http://www.cloudcomputing-china.cn/Article/ShowArticle.asp?ArticleID=1.
  • 10VAQUERO L M, RODERO-MERINO L, CACERES J, et al. A break in the clouds: Towards a cloud definition [ J]. ACM SIGCOMM Computer Communication Review, 2009, 39(1): 50-55.

共引文献1045

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部