摘要
为降低K值的不确定性和初始聚类中心的随机性对聚类结果的影响,提出一种基于优化Canopy算法和均值计算法的改进K-means算法——CK-means+。优化Canopy算法,降低距离阈值T不确定性对最终输出K值的影响,通过Canopy算法和均值计算法得到K值和初始中心点。在UCI数据集上,结合Spark框架并行化,实验结果表明,相较其它算法,CK-means+算法效率更高,可以更好适应大规模数据应用场景。
To reduce the influence of the uncertainty of K values and the randomness of initial clustering centers on the clustering results,a K-means algorithm was improved based on the optimized Canopy algorithm and the mean calculation method(CK-means+).The Canopy algorithm was optimized to reduce the influence of the distance threshold T uncertainty on the final output K-value,and the K-value and initial centroids were obtained through the Canopy algorithm and the mean calculation method.On the UCI dataset and combined with the parallelization of Spark framework,experimental results verify that compared with other algorithms,the CK-means+algorithm is more efficient and can be better adapted to large-scale data application scenarios.
作者
邵金鑫
行艳妮
南方哲
赵鑫
马廷淮
钱育蓉
SHAO Jin-xin;XING Yan-ni;NAN Fang-zhe;ZHAO Xin;MA Ting-huai;QIAN Yu-rong(Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region,Software College,Xinjiang University,Urumqi 830046,China;School of International Education,Nanjing University of Information Science and Technology,Nanjing 210044,China)
出处
《计算机工程与设计》
北大核心
2022年第5期1240-1248,共9页
Computer Engineering and Design
基金
国家自然科学基金项目(61966035)
新疆维吾尔自治区教育厅创新团队基金项目(XJEDU2017T002)。