摘要
传统的K-means算法由于随机选择初始聚类中心,使得聚类结果不精确。随着网络数据量的激增,传统的串行算法运算时间明显太长,有研究者利用Hadoop并行框架进行K-means并行化研究,虽然提高了算法的运行时间,但K-means算法在聚类判定时需要反复迭代,反复进行磁盘的读写操作,很大一部分时间花费在磁盘操作上,并行算法的效率大打折扣。为此,本文提出基于Spark框架的改进并行K-means算法,通过对RDD的操作有效解决了频繁的磁盘读写。在标准数据集下,进行对比实验,通过聚类效果和算法并行的加速比,验证了改进算法的有效性。
The traditional K-means algorithm makes the clustering result inaccurate due to the random selection of initial clusteringcenters. With the surge of network data volume, the traditional serial algorithm operation time is obviously too long. Someresearchers use the Hadoop parallel framework to do K-means parallelization research. Although the running time of the algorithm isimproved,judgment requires repeated iterations and repeated disk read and write operations, a large part of the time is spent on thedisk operation, the efficiency of parallel algorithms is greatly reduced. To this end, this paper proposes an improved parallel K-means algorithm based on the Spark framework, which can effectively solve the frequent disk read and write through the operation ofRDD. Under the standard dataset, a comparison experiment is carried out. The efficiency of the improved algorithm is verified by theclustering effect and the algorithm parallel speedup.
出处
《智能计算机与应用》
2018年第1期76-78,共3页
Intelligent Computer and Applications