期刊文献+

改进的共享最近邻聚类算法 被引量:3

Improved shared nearest neighbor clustering algorithm
下载PDF
导出
摘要 聚类是一种无监督的机器学习方法,其任务是发现数据中的自然簇。共享最近邻聚类算法(SNN)在处理大小不同、形状不同以及密度不同的数据集上具有很好的聚类效果,但该算法还存在以下不足:(1)时间复杂度为O(n2),不适合处理大规模数据集;(2)没有明确给出参数阈值的简单指导性操作方法;(3)只能处理数值型属性数据集。对共享最近邻算法进行改进,使其能够处理混合属性数据集,并给出参数阈值的简单选择方法,改进后算法运行时间与数据集大小成近似线性关系,适用于大规模高维数据集。在真实数据集和人造数据集上的实验结果表明,提出的改进算法是有效可行的。 Clustering is a method of unsupervised learning in machine learning,the typical task of which is to discovery “natural” clusters present in the data.The shared nearest neighbor algorithm is one of the most efficient clustering algorithm which can handle datasets of different sizes,shapes and densities.But there are still some shortages about the algorithm.SNN can’t handle large dataset because of its high complexity.There are no definite methods about threshold of the algorithm.SNN can not process databases with mixture attributes.This paper improves the SNN algorithm to process the data with categorical attributes,gives a simple definite method to select threshold of the algorithm.The time complexity of the improved algorithm is nearly linear with the size of dataset and can be used to large dataset.The experimental results on real datasets and synthetic datasets show that the improved algorithm is effective and practicable.
作者 李霞 蒋盛益
出处 《计算机工程与应用》 CSCD 北大核心 2011年第8期138-142,共5页 Computer Engineering and Applications
基金 国家自然科学基金(No.61070061)~~
关键词 共享最近邻聚类算法 一趟聚类算法 大规模数据集 shared nearest neighbor clustering algorithm one-pass clustering algorithm large dataset
  • 相关文献

参考文献3

二级参考文献32

  • 1Guha S,Rastogi R,Shim K.Cure:An efficient clustering algorithm for large databases[C]//1998 ACM-SIGMOD Int.Conf.Management of Data (SIGMOD'98),seattle WA.USA:1998:73-84.
  • 2Ertoz L,Michael,S,Vipin Kumar.A new shared nearest neighbor clustering algorithm and its applications[C]//Workshop on Clustering High Dimensional Data and its Applications,Second SIAM International Conference on Data Mining,Arlington,VA,USA:2002.
  • 3Ertoz L,Michael S,Vipin Kumar.Finding Clusters of Different Sizes,Shapes,and Densities in Noisy,High Dimensional Data[C].//Proceedings of Third SIAM International Conference on Data Mining,San Francisco,CA,USA:2003.
  • 4Stephen D B,Mark S.Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule[C]//Conference on Knowledge Discovery in Data archive Proceedings of the ninth ACM SIGKDD International Conference (KDD),29-38,Washington,USA:2003:29-38.
  • 5KAUFMAN L, ROUSSEEUW PJ. Finding Groups in Data: An Introduction to Cluster Analysis[ M]. New York: John Wiley & Sons, 1990.
  • 6ESTER M, KRIEGEL HP, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases[A]. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining[C].1996, 8. 226 -231.
  • 7ANKERST M, BREUNIG M, KRIEGEL HP, et al. OPTICS: Ordering points to identify the clustering structure[ A]. Proceedings of ACM SIGMOD International Conference on Management of Data(SIGMOD'99) [ C]. Philadelphia, PA, 1999. 49 -60.
  • 8WANG W, YANG J, MUNTZ R. STING: A statistical information grid approach to spatial data mining[ A]. Proceedings of the 23rd International Conference on Very Large Databases [ C]. AThens,Greece, 1997. 186-195.
  • 9SHEIKHOLESLAMI G, CHATTERJEE S, ZHANG A. WaveCluster: A multi-resolution clustering approach for very large spatial databases[A]. Proceedings of 1998 International Conference on Very Large Data Bases[ C]. New York, 1998. 428 - 439.
  • 10AGRAWAL R, GEHRKE J, GUNOPULOS D, et al. Automatic subspace clustering of high dimensional data for data mining applications[ A]. ACM SIGMOD International Conference on Management of Data[C]. Seattle, WA, 1998. 94-105.

共引文献41

同被引文献25

  • 1张鑫 王文剑.一种基于粒度的支持向量机学习策略.计算机科学,2008,35(8):101-103,116.
  • 2Vapnik V.The Nature of Statistical Learning Theory[M].New York:Springer-Verlay Press,1995:156.
  • 3Yuchun Tang.Granular Support Vector Machines Based On Granular Computing,Soft Computing and Statistical Learning[D].Georgia State University,2006.
  • 4Shifei Ding,Bingjun Qi.Research of granular support vector machine[J].Artif Intell Rev,2012,38(5):1-7.
  • 5Wang Wenjian,Guo Husheng,Jia Yuanfeng,et al.Granular support vector machine based on mixed measure[J].Neurocomputing,2013,101(5):116-128.
  • 6Yuchun Tang,Bo Jin,Yanqing Zhang.Granular support vector machines with association rules mining for protein homology prediction[M].Artificial Intelligence in Medicine,2005(35):121-134.
  • 7Mei Zhen,Shen Qi,Ye Baoxiao.Hybriedized KNN and SVM for gene expression data classification[J].Life Sci.,2009,6:61-66.
  • 8Lam Hong,lee,Chin Heng,et al.A Review of Nearest Neighbor-Support Vector Machines Hybrid Classification Models[J].Journal of Applied Sciences,2010,10(17):1841-1858.
  • 9Jarvis R A,Patrick EA.Clustering.Using a Similarity Measure Based on Shared Nearest Neighbors[J].IEEE Transacitions on Computers,1973,C-22(11):1025-1034.
  • 10Ertoz L,Steinbach M,Kumar V.A New Shared Nearest Neighbor Clustering Algorithm and its Applications[C]//Workshop on Clustering High Dimensional Data and its Applications,Proc.of Text Mine’01,First SIAM intl.Conf.on Data Mining,Chicago,IL,USA,2001.

引证文献3

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部