Hadoop平台下Mahout聚类算法的比较研究被引量：11

Comparison Research on Mahout Clustering Algorithms under Hadoop Platform

下载PDF

导出

摘要聚类是数据挖掘中的一门重要技术,用于将物理或抽象对象的集合划分成由相似对象构成的多个类。如何将传统聚类算法应用于大规模数据的聚类,是当前大数据研究领域中的热点研究问题。对云计算平台Hadoop下开源机器学习软件库——Mahout中的Canopy、标准K-means、模糊K-means 3种聚类算法的原理及其MapReduce实现进行了比较,并在构建的有不同个数节点的集群上,在不同规模的数据集下对这3种聚类算法进行了实验,从加速比、可扩展性和规模增长性3个方面进行比较。实验结果表明,在并行环境下:Canopy算法运行速度最快,K-means算法次之,模糊K-means最慢;3种算法均有较好的加速比,其中Canopy算法加速比最好,模糊K-means算法在数据量和节点个数达到一定规模后加速比大幅提高;3种算法均有较好的可扩展性和规模增长性,且随着数据规模增加,可扩展性和规模增长性增强,其中Canopy算法可扩展性最好,模糊K-means算法的可扩展性和规模增长性增强幅度最大。 Clustering is an important technique in data mining,and it is used to divide the congregation of physical or abstract objects into multiple classes consisting of similar objects.How to apply the traditional clustering algorithm into the clustering of large scale data is the hot research issue in the current data research field.This article conducts the theory analysis and comparison on the principle of three kinds of clustering algorithms of Canopy,Standard K-means and Fuzzy K-means in open-source machine learning software library—Mahout under cloud computing platform—Hadoop and the achievement of MapReduce,and on the cluster constructed by the nodes with different number,under the data sets with different scales,conduct experiment on the three kinds of clustering algorithms,and then conduct comparison from the three aspects of speedup ratio,scalability and scale growth.The experimental results show that:in parallel environment,the running speed of Canopy algorithm is the fastest,K-means algorithm is the second and Fuzzy K-means is the slowest;the three kinds of algorithms have better speedup ratio,and among them,the speedup ratio of Canopy algorithm is the best,the speedup ratio of Fuzzy K-means algorithm substantially increases after the amount of data and the number of nodes achieving a certain scale;the three kinds of algorithms have better scalability and scale growth,and among them,the scalability of Canopy algorithm is the best,the increasing amplitude of scalability and scale growth of Fuzzy K-means algorithm is the largest.

作者牛怡晗海沫

机构地区中央财经大学信息学院

出处《计算机科学》 CSCD 北大核心 2015年第S1期465-469,共5页 Computer Science

基金北京高等学校青年英才计划项目(YETP0988)资助

关键词聚类 HADOOP Mahout K-MEANS 模糊K-means CANOPY 聚类 Hadoop Mahout K-means 模糊K-means Canopy

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献3

1Apache Hadoop [OL]. http://hadoop.apache.org/ .
2Owen S,Anil R,Dunning T,et al.Mahout in action. . 2010
3Apache Mahout. http://Mahout.apache.org .

同被引文献63

1施亮,钱雪忠.基于Hadoop的并行FP-Growth算法的研究与实现[J].微电子学与计算机,2015,32(4):150-154. 被引量：15
2叶志伟,尹宇洁,王明威,赵伟.一种基于杜鹃搜索算法的聚类分析方法[J].微电子学与计算机,2015,32(5):104-110. 被引量：6
3陈兴蜀,吴小松,王文贤,王海舟.基于特征关联度的K-means初始聚类中心优化算法[J].四川大学学报（工程科学版）,2015,47(1):13-19. 被引量：29
4李洁,高新波,焦李成.模糊CLOPE算法及其参数优选[J].控制与决策,2004,19(11):1250-1254. 被引量：4
5Han J W,Kamber M.Data mining:concepts and techniques[M].S a n Francisco,US:Morgan Kaufmann,2001.
6R.W.Sinnott.Virtues of the Haversine.Sky and Telescope 68(2),1984.
7陈全,邓倩妮.云计算及其关键技术[J].计算机应用,2009,29(9):2562-2567. 被引量：929
8吴夙慧,成颖,郑彦宁,潘云涛.K-means算法研究综述[J].现代图书情报技术,2011(5):28-35. 被引量：161
9赵卫中,马慧芳,傅燕翔,史忠植.基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168. 被引量：83
10刘永增,张晓景,李先毅.基于Hadoop/Hive的web日志分析系统的设计[J].广西大学学报（自然科学版）,2011,36(A01):314-317. 被引量：24

引证文献11

1郭佳祺.基于Hadoop平台的经纬度信息的聚类算法研究与改进[J].电子技术与软件工程,2016(8):197-198.
2高见文,薛行贵,罗杰,姜源,吴启武.基于迭代式MapReducede的海量数据并行聚类算法研究[J].中国科技论文,2016,11(14):1626-1631. 被引量：6
3王玉平,郝杨杨,黄有方.基于MapReduce的CLOPE并行聚类算法[J].广西大学学报（自然科学版）,2016,41(5):1567-1575.
4侯春萍,张倩楠,王宝亮,常鹏,孙韶伟.基于Hadoop的视觉词袋模型图像分类算法[J].天津大学学报（自然科学与工程技术版）,2017,50(6):643-648. 被引量：2
5李自尊,冯建,汤进.Hadoop技术在云数据中心的应用研究[J].河南科技,2017,36(21):25-28.
6濮君强.基于聚类分析技术的新能源汽车数据挖掘分析[J].自动化与仪器仪表,2018,0(3):173-176. 被引量：3
7骆孜,龙华,邵玉斌,杜庆治.基于聚类的非负矩阵分解推荐算法研究[J].通信技术,2018,51(11):2675-2679. 被引量：3
8李慧敏.基于Hadoop平台的并行化Canopy聚类算法[J].电脑知识与技术,2018,14(10Z):18-19.
9汪晶,邹学玉,喻维明,孙咏.分布式MVC-Kmeans算法设计与实现[J].长江大学学报（自然科学版）,2019,16(6):113-119. 被引量：3
10孙秀娟.云计算平台上的Canopy-Kmeans并行聚类算法研究[J].现代电子技术,2019,42(19):78-81.

二级引证文献22

1张扬,谢彬,王敬平,唐鹏.基于Hadoop的遥感影像业务管理系统设计[J].计算机系统应用,2018,27(11):64-70. 被引量：4
2欧立奇,何媛,李云飞,赵郁园,刘瀚.海量数据分类中的模糊区域判定算法研究[J].山东农业大学学报（自然科学版）,2018,49(2):335-338.
3张睿萍,马宗梅.基于Hadoop平台的大数据图像分类机制[J].吉林大学学报（理学版）,2018,56(5):1206-1212. 被引量：7
4邵永谦,毕波,王军.上海测震台网监控平台的设计及应用[J].电子设计工程,2019,27(3):11-15. 被引量：2
5顾才东.基于大数据挖掘的新能源汽车主要指标及现状研究[J].苏州市职业大学学报,2019,30(1):11-15. 被引量：3
6孙秀娟.云计算平台上的Canopy-Kmeans并行聚类算法研究[J].现代电子技术,2019,42(19):78-81.
7王晨阳.基于MapReduce的快消品电商网站热搜品牌TOP-N计算[J].福建工程学院学报,2019,17(4):365-370.
8王泽华,柯新生.基于Coclus联合聚类与非负矩阵分解的推荐算法[J].计算机工程,2019,45(11):68-73. 被引量：2
9唐啸虎,刘志锋.基于改进的k-means算法的新闻聚类的研究[J].电脑知识与技术,2020,16(10):201-203. 被引量：1
10王永贵,刘凯奇.一种优化聚类的协同过滤推荐算法[J].计算机工程与应用,2020,56(15):66-73. 被引量：16

1余国清,周兰蓉,罗可.一种模糊K-means算法在测试用例集约简中的应用[J].华侨大学学报（自然科学版）,2016,37(6):778-781.
2邹翔.改进的模糊K-Means聚类算法研究[J].信息与电脑（理论版）,2015(6):97-98.
3穆瑞辉,苗国义.基于粒子群优化的模糊K-Means目标分类算法[J].计算机测量与控制,2013,21(5):1266-1268. 被引量：7
4葛丽娜,钟诚.一个有效的分布式并行挖掘关联规则算法[J].计算机工程与设计,2004,25(8):1258-1260. 被引量：6
5杨浩,朱剑英,周娜.基于J2EE的分布式和可集成的制造执行系统(英文)[J].Transactions of Nanjing University of Aeronautics and Astronautics,2004,21(3):213-219. 被引量：2
6杨柳,张俊芝.浅谈聚类算法及其存在的问题[J].产业与科技论坛,2012,11(2):68-69.
7刘旭东,葛俊杰,陈德人.一种基于聚类和协同过滤的组合推荐算法[J].计算机工程与科学,2010,32(12):125-127. 被引量：13
8孙锐,金澎.一种大规模中文搜索日志的层次聚类方法[J].科技通报,2012,28(8):83-85. 被引量：2
9李瑜.云计算和网格计算[J].科学咨询,2013(18):60-61.
10周粳迪,程东年,刘勤让.报文分类算法可扩展性标准评测系统[J].计算机工程与设计,2009,30(18):4141-4145.

计算机科学

2015年第S1期

浏览历史

内容加载中请稍等...

Hadoop平台下Mahout聚类算法的比较研究被引量：11

参考文献3

同被引文献63

引证文献11

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

Hadoop平台下Mahout聚类算法的比较研究 被引量：11

参考文献3

同被引文献63

引证文献11

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

Hadoop平台下Mahout聚类算法的比较研究被引量：11