期刊文献+

基于Spark的K-means改进算法的并行化实现 被引量:3

Parallel Implementation of Improved K-means Algorithm Based on Spark
下载PDF
导出
摘要 针对传统K-means算法在处理海量数据时,存在计算复杂度高和计算能力不足等问题,提出了SKDkmeans(Spark based kd-tree K-means)并行聚类算法.该算法通过引入kd-tree改善初始中心点的选择,克服传统Kmeans算法因初始点的不确定性,易陷入局部最优解的问题,同时利用kd-tree的最近邻搜索减少K-means在迭代中的距离计算,加快聚类速度,并在Spark平台上实现了该算法的并行化,使其适用于海量数据聚类,最后通过实验验证了算法具有良好的准确率和并行计算性能. In view of the problems that when processing massive data the traditional K-means is highly complex and insufficient in computation, a SKDk-means (Spark based kd-tree K-means) parallel clustering algorithm has been proposed. The algorithm improves the choice of initial center point by introducing kd-tree and overcomes the problem that the traditional K-means algorithm is easy to fall into the local optimal solution due to the uncertainty of the initial point. During K-means iterative calculation, the redundant computation has been reduced and clustering speed has been accelerated by the nearest neighbor search of kd-tree. The parallelization of the algorithm is realized on the spark platform and it is applied to the massive data clustering. Finally, the experimental results show that the algorithm has good accuracy and parallel computing performance.
作者 宋董飞 徐华 SONG Dong-Fei, XU Hua(School of Intemet of Things Engineering, Jiangnan University, Wuxi 214122, Chin)
出处 《计算机系统应用》 2018年第4期151-156,共6页 Computer Systems & Applications
基金 江苏省自然科学基金(BK20140165) 国家留学基金委项目(201308320030)
关键词 KD-TREE SPARK K-MEANS 并行化 云计算 kd-tree Spark K-means parallel cloud computing
  • 相关文献

参考文献6

二级参考文献54

共引文献1240

同被引文献19

引证文献3

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部