针对多聚类中心大数据集的加速K-means聚类算法被引量：28

Accelerate K-means for multi-center clustering of big datasets

下载PDF

导出

摘要随着数据量、数据维度呈指数发展以及实际应用中聚类中心个数的增多,传统的K-means聚类算法已经不能满足实际应用中的时间和内存要求。针对该问题提出了一种基于动态类中心调整和Elkan三角判定思想的加速K-means聚类算法。实验结果证明,当数据规模达到10万条,聚类个数达到20个以上时,本算法相比Elkan算法具有更快的收敛速度和更低的内存开销。 The K-means algorithm is the most popular cluster algorithm, but for big dataset clustering with many clusters, it will take a lot of time to find all the clusters. This paper proposed a new acceleration method based on the thought of dynamical and immediate adjustment of the center K-means with triangle inequality. The triangle inequality was used to avoid redundant distance computations; But unlike Elkan＇ s algorithm, the centers were divided into outer-centers and inner-centers for each data point in tl^e first place, and only the tracks of the lower bounds to inner-centers were kept; On the other hand, by adjus- ting the data points cluster by cluster and updating the cluster center immediately right after finishing each cluster＇ s adjust- ment, the number of iteration was effectively reduced. The experiment results show that this algorithm runs much faster than Elkan＇ s algorithm with much less memory consumption when the cluster center number is larger than 20 and the dataset re- cords number is greater than 10 million, and the speedup becomes better when the k increases.

作者张顺龙库涛周浩

机构地区中国科学院沈阳自动化研究所中国科学院大学吉化集团吉林市软信技术有限公司

出处《计算机应用研究》 CSCD 北大核心 2016年第2期413-416,共4页 Application Research of Computers

基金国家科技支持计划资助项目(2012BAH15F05) 吉林省科技型中小企业技术创新基金资助项目(12C26212201399) 国家自然科学基金资助项目(612033161 51205389)

关键词 DIACK 加速K-means 聚类三角定理 DIACK fast k-means clustering triangle inequality

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献23

1MaeQueen J.Some methods for classification and analysis of multivariate observations[C]//Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
2Chinrungrueng C,Sequin C H.Optimal adaptive K-means algorithm with dynamic adjustment of learning rate[J].IEEE Neural Networks,1995,6(1):157-169.
3Darken C ,Moody J.Fast adaptive K-means clustering:some empirical results[C]//International Joint Conference on Neural Networks.1990.
4Farnstrom F,Lewis J,Elkan C.Scalability for clustering algorithms revisited[J].ACM SIGKDD Explorations Newsletter,2000,2(1):51-57.
5Fraing G,Sohler C.A fast K-means implementation using corsets[J].International Journal of Computational Geometry & Applications,2008,18(6):605-625.
6方毅,熊盛武.一种快速的K均值聚类算法[C]//2005中国模糊逻辑与计算智能联合学术会议论文集.合肥:中国科学技术大学出版社,2005.
7Pelleg D,Moore A.Accelerating exact K-means algorithms with geometric reasoning[C]// Proc of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999.
8Kanungo T,Mount D M,Netanyahu N S,et al.An efficient K-means clustering algorithm:analysis and implementation[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2002,24(7):881-892.
9Smellie A.Accelerated K-means clustering in metric spaces[J].Journal of Chemical Information and Modeling,2004,44(6):1929-1935.
10Ding C.K-means clustering via principal component analysis[C]//Proc of the 21st International Conference on Machine Learning.New York:Carla Brodley,2004.

二级参考文献27

1张雷,李人厚.人工免疫C-均值聚类算法[J].西安交通大学学报,2005,39(8):836-839. 被引量：17
2张世勇.一种新的混合粒子群优化算法[J].重庆工商大学学报（自然科学版）,2007,24(3):241-245. 被引量：6
3MACQUEEN J. Some methods for classification and analysis of multivariate observations [ C]. In: Proceedings of the 5th Berkeley Symposium on Mathematics Statistic Problem, 1967. 281 -297.
4SARKAR M, YEGNANARAYANA B, KHEMANI D. A clustering algorithm using an evolutionary programming - based approach [ J ]. Pattern Recognition Letters, 1997,18 (10) : 975 - 986.
5KRISHNA K, MURTY M. Genetic K- means algorithm [J]. IEEE Trans on System, Man and Cybernetics: Part B, 1999, 29(3) :433 -439.
6CLERC M. The swarm and the queen : towards a deterministic and adaptive particle swarm optimization [ C ]. In: Proceedings of the IEEE Congress on Evolutionary Computation, 1999. 1951 -1957.
7Cheu E Y, Kwoh C K, Zhou Z. On the two-level hybrid clustering algorithm[C]//International Conference on Artificial Intelligence in Science and Technology. Berlin, Germany: Springer Verlag, 2004.
8Wang H L. An unsupervised purchase-based customer clustering method for e-supply chain[C]//IEEE International Conference on Service Operations and Logistics, and Informatics: vol. 1. Piscataway, NJ, USA: IEEE, 2008: 686-688.
9Chang H, Yeung D Y. Robust path-based spectral clustering[J]. Pattern Recognition, 2007, 41(1): 191-203.
10Yu X P, Zhou D Y, Zhou Y. A new clustering algorithm based on distance and density[C]//International Conference on Services Systems and Services Management: vol.2. Piscataway, NJ, USA: IEEE, 2005: 1016-1021.

共引文献133

1孙美卫.一种基于学习模型与BoW-SURF的目标识别算法[J].中原工学院学报,2021(1):79-83.
2刘婷,郭海湘,诸克军,高思维.一种改进的遗传k-means聚类算法[J].数学的实践与认识,2007,37(8):104-111. 被引量：22
3徐辉,李石君.一种整合粒子群优化和K-均值的数据聚类算法[J].山西大学学报（自然科学版）,2011,34(4):518-523. 被引量：9
4叶志伟,尹宇洁,王明威,赵伟.一种基于杜鹃搜索算法的聚类分析方法[J].微电子学与计算机,2015,32(5):104-110. 被引量：6
5张顶学,关治洪,刘新芝.基于PSO的RBF神经网络学习算法及其应用[J].计算机工程与应用,2006,42(20):13-15. 被引量：44
6高尚,汤可宗,杨静宇.一种新的基于混合蚁群算法的聚类方法[J].微电子学与计算机,2006,23(12):38-40. 被引量：17
7刘纯青,杨莘元,张颖.基于文化算法的聚类分析[J].计算机应用,2006,26(12):2953-2955. 被引量：14
8谷保平,许孝元,郭红艳.基于粒子群优化的k均值算法在网络入侵检测中的应用[J].计算机应用,2007,27(6):1368-1370. 被引量：24
9周欢,黄立平.基于SOM神经网络的C-均值聚类算法[J].计算机应用,2007,27(B06):51-52. 被引量：6
10肖会敏,刘臣,杨晓兵.基于改进微粒群算法的K-MEANS聚类和孤立点查找[J].河南科学,2007,25(1):107-111. 被引量：1

同被引文献231

1MA Ming TAO Shanchang ZHU Baoyou Lu Weitao.Climatological distribution of lightning density observed by satellites in China and its circumjacent regions[J].Science China Earth Sciences,2005,48(2):219-229. 被引量：27
2周丽娟,王慧,王文伯,张宁.面向海量数据的并行KMeans算法[J].华中科技大学学报（自然科学版）,2012,40(S1):150-152. 被引量：32
3张石磊,武装.一种基于Hadoop云计算平台的聚类算法优化的研究[J].计算机科学,2012,39(S2):115-118. 被引量：29
4江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报（自然科学版）,2011,39(S1):120-124. 被引量：79
5龙小祥,王小燕,钟惠敏.星载焦面视场拼接TDI CCD相机成像质量及处理方法分析[J].中国科学：信息科学,2011,41(S1):19-31. 被引量：7
6ZHANG GuangShu1,WANG YanHui1,QIE XiuShu2,ZHANG Tong1,ZHAO YuXiang1,3,LI YaJun1 & CAO DongJie11 Laboratory for Climate Environment and Disasters of Western China,Cold and Arid Regions Environmental and Engineering Research Institute,Chinese Academy of Sciences,Lanzhou 730000,China,2 Laboratory for Middle Atmosphere and Global Environment Observation(LAGEO),Institute of Atmospheric Physics,Chinese Academy of Sciences,Beijing 100029,China,3 Department of Physics,Tianshui Normal College,Tianshui 741001,China.Using lightning locating system based on time-of-arrival technique to study three-dimensional lightning discharge processes[J].Science China Earth Sciences,2010,53(4):591-602. 被引量：23
7韦素云,肖静静,业宁.基于联合聚类平滑的协同过滤算法[J].计算机研究与发展,2013,50(S2):163-169. 被引量：12
8王娟,谌芸.2009-2012年中国闪电分布特征分析[J].气象,2015,41(2):160-170. 被引量：88
9郄秀书,袁铁,谢毅然,马耀明.青藏高原闪电活动的时空分布特征[J].地球物理学报,2004,47(6):997-1002. 被引量：43
10马明,陶善昌,祝宝友,吕伟涛,谭涌波.全球闪电活动对气温变化的响应[J].科学通报,2005,50(15):1643-1647. 被引量：18

引证文献28

1刘宝龙,苏金.双MapReduce改进的Canopy-Kmeans算法[J].西安工业大学学报,2016,36(9):730-737. 被引量：6
2刘岩,王存睿.基于抽样融合改进的大数据聚类方法[J].微电子学与计算机,2017,34(4):17-21. 被引量：12
3杨宗宪,邬春学,高丽萍,朱思征,王山山.支持三级缓存的移动小组域实时协同模型研究[J].小型微型计算机系统,2017,38(5):972-976. 被引量：1
4李淋淋,倪建成,曹博,于苹苹,姚彬修.基于Spark框架的并行聚类算法[J].计算机技术与发展,2017,27(5):97-101. 被引量：6
5罗嗣卿,刘璐.改进K-means算法对大兴安岭蓝莓干销售预测的应用[J].黑龙江大学自然科学学报,2017,34(2):139-144. 被引量：2
6王晰巍,张柳,李师萌,王楠阿雪.新媒体环境下社会公益网络舆情传播研究——以新浪微博“画出生命线”话题为例[J].数据分析与知识发现,2017,1(6):93-101. 被引量：13
7黄利,尤红建.基于聚类的非共线多CCD遥感图像误匹配点去除方法[J].电子与信息学报,2017,39(10):2382-2389. 被引量：2
8惠雯,黄富祥,郭强.卫星与地基闪电探测资料在闪电活动研究中的综合应用[J].光学精密工程,2018,26(1):218-229. 被引量：17
9李攀攀,童鑫,沈凯,钱麟.基于人类学习优化算法的K-Means在智能温室大棚中的应用[J].工业控制计算机,2018,31(8):93-94. 被引量：1
10于化龙,韩雪峰.基于改进K均值聚类的银行客户分类算法[J].湘潭大学自然科学学报,2018,40(3):125-128. 被引量：3

二级引证文献163

1段相宜.基于社会网络分析的环保类网络舆情信息传播研究——以新浪微博“垃圾分类”话题为例[J].新媒体研究,2021,7(24):6-10. 被引量：1
2王生玉.基于多维相似度的网络传输通道恶意入侵检测方法[J].科技通报,2021,37(11):57-60. 被引量：2
3李浩光.大数据网络分布式独立内存分配算法研究[J].科技通报,2021,37(4):37-41.
4刘凌旗,张炜,王洪川.世界人工智能研究储量及技术热点分析——基于2013~2018年SCIE高质量数据[J].中国电子科学研究院学报,2020,15(2):115-124.
5刘航,李锡祚.基于深度学习的协同过滤推荐算法[J].智能计算机与应用,2020(8):100-104. 被引量：2
6谢伯林,王正国,朱佩芳,严密,张军军.大鼠视网膜光化学损伤的病理特征[J].第三军医大学学报,2000,22(5):442-444. 被引量：18
7张晓婷,李茵,唐晶磊.基于优化聚类算法的大数据分流系统设计仿真[J].计算机仿真,2018,35(12):204-207. 被引量：6
8李向.基于蚁群算法优化Hadoop平台计算效能方法[J].微型电脑应用,2018,34(12):140-143. 被引量：1
9马洋春,王兴芬.基于Spark的K-means聚类的并行实现与优化[J].福建电脑,2017,33(11):1-4. 被引量：1
10潘丽艳.我国网络舆情信息工作现状及措施浅探[J].采写编,2017,0(5):11-12.

1郭建伟,张莹莹.基于WebGIS的电子商务空间数据挖掘方法研究[J].硅谷,2010,3(11):176-176.
2王蓉,高立群,柴玉华,杨姝.一种多聚焦图像融合方法[J].控制与决策,2005,20(11):1256-1260. 被引量：5
3陈木生.基于contourlet变换和模糊理论的多聚焦图像融合[J].泉州师范学院学报,2012,30(2):23-26.
4张琴.基于小波变换的多聚焦图像融合[J].湖北教育学院学报,2007,24(8):58-59. 被引量：1
5方凯,那彦,王丽亚.一种新的可见光多聚焦图像融合算法[J].微电子学与计算机,2006,23(1):111-114. 被引量：2
6陈木生.基于小波变换的多聚焦彩色图像融合新方法[J].计算机工程与应用,2008,44(32):189-190. 被引量：5
7顾洪博,赵万平.数据挖掘算法性能优化的研究与应用[J].长春理工大学学报（自然科学版）,2010,33(1):164-166. 被引量：9
8陈蜜,伭剑辉,李德仁,秦前清,贾永红.独立分量分析的图像融合算法[J].光电工程,2007,34(6):82-87. 被引量：9
9郭晓月,胡红萍,张笑天.基于NSCT变换的多聚焦图像融合[J].科技信息,2014(9):30-30.
10滕腾.基于小波变换的图像融合[J].科技信息,2012(8):156-157.

计算机应用研究

2016年第2期

浏览历史

内容加载中请稍等...

针对多聚类中心大数据集的加速K-means聚类算法被引量：28

参考文献23

二级参考文献27

共引文献133

同被引文献231

引证文献28

二级引证文献163

相关作者

相关机构

相关主题

浏览历史

针对多聚类中心大数据集的加速K-means聚类算法 被引量：28

参考文献23

二级参考文献27

共引文献133

同被引文献231

引证文献28

二级引证文献163

相关作者

相关机构

相关主题

浏览历史

针对多聚类中心大数据集的加速K-means聚类算法被引量：28