基于MapReduce的CLOPE并行聚类算法

A CLOPE parallel clustering algorithm based on MapReduce

下载PDF

导出

摘要事务型数据的CLOPE聚类算法在运行速度、内存开销和聚类效果方面表现优异,但随着数据量飞速增长,其运行时间也随之急剧变长甚至无法使用。为此,利用Hadoop框架下的YARN资源管理系统,对CLOPE算法进行改进,提出基于MapReduce架构的CLOPE并行聚类算法。该算法由两个阶段组成,第一阶段执行Map操作,Hadoop架构对数据集分片并行并运行CLOPE算法聚类成小聚簇;第二阶段执行Reduce操作,通过多次迭代把各个小聚簇聚合成大聚簇。实验结果证明：分析1 000条20 000个属性的亚马逊数据记录,MapReduce-CLOPE算法耗时稳定在22 s,而CLOPE算法耗时在50-60 s。随着数据量的增大,CLOPE算法无法计算而MapReduce-CLOPE算法耗时基本稳定。因此,MapReduce-CLOPE算法在计算时间方面要显著地优于CLOPE算法,且计算时间受数据量大小的影响较小,而在聚类质量方面与CLOPE算法相近。 A CLOPE parallel algorithm based on MapReduce （ MapReduce-CLOPE） is presented in this paper. The algorithm consists of two phases：. In the first phase, the large datasets on Hadoop are split into multiple small data blocks by Map operations. and the CLOPE algorithm is executed on each data block in parallel to form small clusters. In the second phase, the algorithm will merge the small clusters into multiple large clusters through multiple iterations, by executing Reduce opera-tions. The experiments show that it takes 22 seconds steadily in MapReduce-CLOPE algorithm when analyzing 1 000 Amazon data records of 20 000 attributes, while it takes between 50 and 60 seconds in CLOPE algorithm. With the data volume increasing, CLOPE algorithm cannot finish the calcula-tion, however, MapReduce-CLOPE algorithm can get the calculation with stable time. Therefore, MapReduce-CLOPE algorithm is superior significantly than CLOPE algorithm in the time and the＆amp;nbsp;influence of data volume, and it’s close to CLOPE algorithm in clustering quality.

作者王玉平郝杨杨黄有方

机构地区上海海事大学信息化办公室上海海事大学物流研究中心

出处《广西大学学报（自然科学版）》 CAS 北大核心 2016年第5期1567-1575,共9页 Journal of Guangxi University（Natural Science Edition）

基金国家自然科学基金资助项目(71301101) 交通运输部建设科技项目(2015328810160) 上海市科委重点项目(14DZ2280200)

关键词数据挖掘 CLOPE MAPREDUCE 聚类算法 HADOOP data mining CLOPE MapReduce clustering algorithm Hadoop

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献8

1丁祥武,郭涛,王梅,金冉.一种大规模分类数据聚类算法及其并行实现[J].计算机研究与发展,2016,53(5):1063-1071. 被引量：21
2李洁,高新波,焦李成.模糊CLOPE算法及其参数优选[J].控制与决策,2004,19(11):1250-1254. 被引量：4
3陈全,邓倩妮.云计算及其关键技术[J].计算机应用,2009,29(9):2562-2567. 被引量：931
4牛怡晗,海沫.Hadoop平台下Mahout聚类算法的比较研究[J].计算机科学,2015,42(S1):465-469. 被引量：11
5李静滨,杨柳,陈宁江.基于MapReduce的改进K-Medoids并行算法[J].广西大学学报（自然科学版）,2014,39(2):341-345. 被引量：5
6刘永增,张晓景,李先毅.基于Hadoop/Hive的web日志分析系统的设计[J].广西大学学报（自然科学版）,2011,36(A01):314-317. 被引量：24
7刘义,景宁,陈荦,熊伟.MapReduce框架下基于R-树的k-近邻连接算法[J].软件学报,2013,24(8):1836-1851. 被引量：60
8李晔锋,乐嘉锦,王梅,张滨,刘良旭.MR-CLOPE: A Map Reduce based transactional clustering algorithm for DNS query log analysis[J].Journal of Central South University,2015,22(9):3485-3494. 被引量：2

二级参考文献92

1李洁,高新波,焦李成.模糊CLOPE算法及其参数优选[J].控制与决策,2004,19(11):1250-1254. 被引量：4
2Li Jie Gao Xinbo Jiao Licheng.A FUZZY CLOPE ALGORITHM AND ITS OPTIMAL PARAMETER CHOICE[J].Journal of Electronics(China),2006,23(3):384-388. 被引量：1
3VARIA J. Cloud architectures - Amazon Web services [ EB/OL]. [ 2009 - 03 - 01 ]. http://acmbangalore, org/events/monthly-talk/ may-2008 --cloud-architectures---amazon-web-services. html.
4BRYANT R E. Data-intensive supercomputing: The case for DISC, CMU-CS-07-128 [ R]. Pittsburgh, PA, USA: Carnegie Mellon University, Department of Computer Science, 2007.
5SZALAY A S, KUNSZT P, THAKAR A, et al. Designing and mining multi-terabyte astronomy archives: The sloan digital sky survey [ C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2000:451 - 462.
6BARROSO L A, DEAN J, HOLZLE U. Web search for a planet: The Google cluster architecture [ J]. IEEE Micro, 2003, 23(2) : 22 -28.
7GILES J. Google tops translation ranking [ EB/OL]. (2006 - 11 - 06) [ 2009 - 03 - 06 ]. http://www, nature, com/news/2006/ 061106/full/news061106-6. html.
8维基百科.Cloud computing [ EB/OL]. [ 2009 - 03 - 10]. http://en. wikipedia, org/wiki/Cloud_computing.
9中国云计算网.什么是云计算?[EB/OL].(2008-05-14)[2009-02-27].http://www.cloudcomputing-china.cn/Article/ShowArticle.asp?ArticleID=1.
10VAQUERO L M, RODERO-MERINO L, CACERES J, et al. A break in the clouds: Towards a cloud definition [ J]. ACM SIGCOMM Computer Communication Review, 2009, 39(1): 50-55.

共引文献1045

1胡祖林,肇杰.云计算下的网盘安全[J].计算机产品与流通,2020,0(1):164-164.
2陈小样.关于数据统计的课程推荐算法在远程教育平台的应用概述[J].吉林广播电视大学学报,2021(6):21-23. 被引量：1
3宋东翔,马伽洛伦,王怡然,袁铭举.基于云原生和区块链的高校智能人事系统的研究[J].新一代信息技术,2022,5(6):67-70.
4王娟,沈小平,叶萌.云计算在医疗卫生职业教育信息化发展中的应用探索[J].微型电脑应用,2011(3):42-44. 被引量：5
5王晓光.一种云计算作业管理代理系统[J].有线电视技术,2012,19(6):75-78.
6王晓光.基于SPN模型的云计算作业管理代理系统性能分析[J].有线电视技术,2012,19(7):91-94.
7聂良刚,陈军.基于校园网综合文科实验室的建设与管理[J].广西广播电视大学学报,2012,23(1):30-33.
8陈冬冬.云计算以及数字图书馆发展探析[J].长春理工大学学报（高教版）,2012(11):265-266. 被引量：3
9龚强.构建测绘地理信息行业云初探[J].测绘与空间地理信息,2013,36(1):1-4. 被引量：10
10杨枫,祁慧敏,靳贺敏.基于PaaS云平台的计算机类课程实验教学模式探析[J].河南财政税务高等专科学校学报,2013,27(4):82-85. 被引量：4

1张娅萍.事务型数据挖掘中隐私的法律保护研究[J].知识经济,2014(19):32-32.
2范全润,田林.基于频繁项目集的多隶属聚类算法[J].楚雄师范学院学报,2004,19(3):1-5. 被引量：1
3于金良,朱志祥,李聪颖.Hadoop MapReduce新旧架构的对比研究综述[J].计算机与数字工程,2017,45(1):83-87. 被引量：8
4李金文.论提高局域网传输速率方法[J].硅谷,2010,3(5):82-83.
5朱楠.数据挖掘技术在个性化网络教学系统中的研究与实现[J].苏州科技学院学报（自然科学版）,2014,31(2):75-80. 被引量：1
6Paul Butler.过程控制的革命性剧变:知识系统提高用户的经营业绩[J].数字石油和化工,2006(1):71-72.
7Paul Butler.过程控制的革命性剧变[J].现代制造,2005(11):36-37.
8朱楠.基于Web日志挖掘的个性化网络教学系统研究[J].河南机电高等专科学校学报,2012,20(3):29-33.
9郑欣杰,朱程荣,熊齐邦.基于MapReduce的分布式光线跟踪的设计与实现[J].计算机工程,2007,33(22):83-85. 被引量：7
10金晶,王妍,李昕,陈山枝.MapReduce架构的多控制节点改进[J].北京邮电大学学报,2012,35(4):89-93. 被引量：2

广西大学学报（自然科学版）

2016年第5期

浏览历史

内容加载中请稍等...

基于MapReduce的CLOPE并行聚类算法

参考文献8

二级参考文献92

共引文献1045

相关作者

相关机构

相关主题

浏览历史