The traditional K-means algorithm makes the clustering result inaccurate due to the random selection of initial clusteringcenters. With the surge of network data volume, the traditional serial algorithm operation time is obviously too long. Someresearchers use the Hadoop parallel framework to do K-means parallelization research. Although the running time of the algorithm isimproved,judgment requires repeated iterations and repeated disk read and write operations, a large part of the time is spent on thedisk operation, the efficiency of parallel algorithms is greatly reduced. To this end, this paper proposes an improved parallel K-means algorithm based on the Spark framework, which can effectively solve the frequent disk read and write through the operation ofRDD. Under the standard dataset, a comparison experiment is carried out. The efficiency of the improved algorithm is verified by theclustering effect and the algorithm parallel speedup.
Intelligent Computer and Applications