期刊文献+

基于Spark框架的改进并行K-means算法研究 被引量:2

Research of improved parallel K-means algorithm based on Spark framework
下载PDF
导出
摘要 传统的K-means算法由于随机选择初始聚类中心,使得聚类结果不精确。随着网络数据量的激增,传统的串行算法运算时间明显太长,有研究者利用Hadoop并行框架进行K-means并行化研究,虽然提高了算法的运行时间,但K-means算法在聚类判定时需要反复迭代,反复进行磁盘的读写操作,很大一部分时间花费在磁盘操作上,并行算法的效率大打折扣。为此,本文提出基于Spark框架的改进并行K-means算法,通过对RDD的操作有效解决了频繁的磁盘读写。在标准数据集下,进行对比实验,通过聚类效果和算法并行的加速比,验证了改进算法的有效性。 The traditional K-means algorithm makes the clustering result inaccurate due to the random selection of initial clusteringcenters. With the surge of network data volume, the traditional serial algorithm operation time is obviously too long. Someresearchers use the Hadoop parallel framework to do K-means parallelization research. Although the running time of the algorithm isimproved,judgment requires repeated iterations and repeated disk read and write operations, a large part of the time is spent on thedisk operation, the efficiency of parallel algorithms is greatly reduced. To this end, this paper proposes an improved parallel K-means algorithm based on the Spark framework, which can effectively solve the frequent disk read and write through the operation ofRDD. Under the standard dataset, a comparison experiment is carried out. The efficiency of the improved algorithm is verified by theclustering effect and the algorithm parallel speedup.
作者 邓青 杨宁
出处 《智能计算机与应用》 2018年第1期76-78,共3页 Intelligent Computer and Applications
关键词 SPARK K-MEANS MAP REDUCE HADOOP 加速比 Spark K-means Map Reduce Hadoop speedup
  • 相关文献

参考文献4

二级参考文献36

  • 1李永森,杨善林,马溪骏,胡笑旋,陈增明.空间聚类算法中的K值优化问题研究[J].系统仿真学报,2006,18(3):573-576. 被引量:39
  • 2钱线,黄萱菁,吴立德.初始化K-means的谱方法[J].自动化学报,2007,33(4):342-346. 被引量:32
  • 3Han J, Kamber M. Data Mining Concepts and Techniques. Orlando, USA: Morgan Kaufmann Publishers, 2001
  • 4Huang J Z, Ng M K, Rang Hongqiang, et al. Automated Variable Weighting in K-means Type Clustering. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27 (5) : 657 - 668
  • 5Dhillon I S, Guan Yuqiang, Kogan J. Refining Clusters in High Dimensional Text Data//Proc of the 2nd SIAM Workshop on Clustering High Dimensional Data. Arlington, USA, 2002 : 59 - 66
  • 6Zhang B. Generalized K-Harmonic Means: Dynamic Weighting of Data in Unsupervised Learning//Proc of the 1 st SIAM International Conference on Data Mining. Chicago, USA, 2001 : 1 - 13
  • 7Sarafis I, Zalzala A M S, Trinder P W. A Genetic Rule-Based Data Clustering Toolkit//Proc of the Congress on Evolutionary Computation. Honolulu, USA, 2002 : 1238 - 1243
  • 8Ma J, Perkins S. Time-Series Novelty Detection Using One-Class Support Vector Machines// Proc of the International Joint Conference on Neural Networks. Portland, USA, 2003, Ⅲ: 1741 - 1745
  • 9Kaufman L,Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, 1990
  • 10Rui Xu, Wunsch D I I. Survey of Clustering Algorithms. IEEE Trans on Neural Networks, 2005, 16(3 ) : 645 -678

共引文献236

同被引文献9

引证文献2

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部