摘要
在众多聚类算法中,基于网格划分思想的聚类算法是较为常用的算法类型之一,但现有的算法对于处理海量高维数据而言,会存在以下两个问题:一是聚类结果的准确率较低;二是算法耗时较长.为了解决现有算法的不适应性,该文在网格聚类算法的基础上结合降维技术、自适应网格划分、相对熵和分布式计算,提出了一种改进的自适应网格划分的分布式聚类算法(AMCBS),可以较好解决以上问题.经实验证明,该算法对于D31标准数据集、UCI数据集、人脸图片数据集和GitHub文本数据集等的效果均优于常见的聚类算法,具有较好的准确率和较高的运行效率.
Among many clustering algorithms,the clustering algorithm based on meshing idea is one of the more commonly used algorithm types.However,the existing algorithms have the following two problems for processing massive high-dimensional data:first,the accuracy of clustering results is low;second,the algorithm takes a long time.In order to solve the inadaptability of the existing algorithms,this paper combines the dimension reduction technology,adaptive mesh partitioning,relative entropy and distributed computing on the basis of grid clustering algorithm,and proposes an adaptive meshing clustering algorithm based on Spark platform(AMCBS),which can better solve the above problems.The experimental results show that this algorithm is better than the common clustering algorithms for D31 standard dataset,UCI dataset,face image dataset and GitHub text dataset,and has better accuracy and higher operating efficiency.
作者
蔡莉
王浩宇
周君
何婧
刘俊晖
CAI Li;WANG Hao-yu;ZHOU Jun;HE Jing;LIU Jun-hui(School of Software,Yunnan University,Kunming 650091,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2023年第4期731-736,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61663047)资助。