期刊文献+

基于Hadoop的二分K均值改进算法

Improved Bisecting Kmeans Algorithm Based on Hadoop
下载PDF
导出
摘要 传统的二分K均值算法在二分过程中采用随机选择的方式选择聚类质心,为了选择到理想的质心,需要重复选择多次,这种方式的时间代价比较大。为此,本文提出采用极大距离点作为质心的方式,有效的降低了时间复杂度,同时运用点抽样的方法避免离群点带来的影响。同时,考虑到随着时代的发展,我们面临的数据量会越来越大,本文提出了一种基于Hadoop分布式平台的并行二分K均值算法,实验表明,该并行算法能够获得较为理想的加速比。 The traditional Bisecting K-Means clustering algorithm adopts the random mode in selecting the initial centroid. Multiple repeated selections are needed in order to select the ideal centroid, which is extremely time consuming. For this reason,the paper proposes to select the two patterns with distance maximum as the initial cluster centroid,which effectively accelerates the clustering. We also adopt point sampling to avoid the influence of outliers.Considering the development of the times,the amount of data facing us will be larger and larger,so we propose a parallelism algorithm based on Hadoop. Experimental results show that the algorithm gets ideal speedup performance and efficiency.
出处 《科技广场》 2016年第9期4-8,共5页 Science Mosaic
基金 国家自然基金项目"基于深度信息和显著计算的手势交互技术研究与应用"(编号:61363046) 立项作者:杨文姬
关键词 二分K均值 优化 并行 HADOOP 加速比 Bisecting K-Means Optimization Parallelism Hadoop Speedup
  • 相关文献

参考文献4

二级参考文献32

  • 1李凯,李昆仑,崔丽娟.模型聚类及在集成学习中的应用研究[J].计算机研究与发展,2007,44(z2):203-207. 被引量:7
  • 2贺玲,吴玲达,蔡益朝.数据挖掘中的聚类算法综述[J].计算机应用研究,2007,24(1):10-13. 被引量:225
  • 3谢崇宝,袁宏源,郭元裕.最优分类的模糊划分聚类改进方法[J].系统工程,1997,15(1):58-63. 被引量:12
  • 4Savaresi S M, Boley D. On the Performance of Bisecting K-Means and PDDP[C]//Proc. of the 1st SIAM International Conference on Data Mining. Chicago, USA: [s. n.], 2001: 1-14.
  • 5Steinbach M, Karypis G, Kumar V. A Comparison of Document Clustering Techniques[C]//Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA: [s. n.], 2000: 525-526.
  • 6Liu Xiaozhang, Feng Guocan. Kernel Bisecting K-Means Clustering for SVM Training Sample Reduction[C]//Proc. of the 19th International Conference on Pattern Recognition. Tampa, USA: [s. n.], 2008: 1-4.
  • 7Han Jiawei,Kamber M.数据挖掘概念与技术[M].范明,孟小峰,译.北京:机械工业出版社,2006
  • 8Bandyopadhyay S, Maulik U.An evolutionary technique based on K-means algorithm for optimal clustering in RN[J].Information Sciences,2002,146:221-237.
  • 9Larsen B, Aone C.A new cluster validity indexes for the fuzzy c-mean[C]//KDD-99, San Diego, California, 1999.
  • 10Steinbach M, Karypis G, Kumar V, et al.Don' t worry be messy.Technical Report #00-034[R].2000.

共引文献90

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部