摘要
聚类是一种无监督的机器学习方法,其任务是发现数据中的自然簇。共享最近邻聚类算法(SNN)在处理大小不同、形状不同以及密度不同的数据集上具有很好的聚类效果,但该算法还存在以下不足:(1)时间复杂度为O(n2),不适合处理大规模数据集;(2)没有明确给出参数阈值的简单指导性操作方法;(3)只能处理数值型属性数据集。对共享最近邻算法进行改进,使其能够处理混合属性数据集,并给出参数阈值的简单选择方法,改进后算法运行时间与数据集大小成近似线性关系,适用于大规模高维数据集。在真实数据集和人造数据集上的实验结果表明,提出的改进算法是有效可行的。
Clustering is a method of unsupervised learning in machine learning,the typical task of which is to discovery “natural” clusters present in the data.The shared nearest neighbor algorithm is one of the most efficient clustering algorithm which can handle datasets of different sizes,shapes and densities.But there are still some shortages about the algorithm.SNN can’t handle large dataset because of its high complexity.There are no definite methods about threshold of the algorithm.SNN can not process databases with mixture attributes.This paper improves the SNN algorithm to process the data with categorical attributes,gives a simple definite method to select threshold of the algorithm.The time complexity of the improved algorithm is nearly linear with the size of dataset and can be used to large dataset.The experimental results on real datasets and synthetic datasets show that the improved algorithm is effective and practicable.
出处
《计算机工程与应用》
CSCD
北大核心
2011年第8期138-142,共5页
Computer Engineering and Applications
基金
国家自然科学基金(No.61070061)~~
关键词
共享最近邻聚类算法
一趟聚类算法
大规模数据集
shared nearest neighbor clustering algorithm one-pass clustering algorithm large dataset