期刊文献+

一种基于动态近邻选择模型的聚类算法 被引量:18

A Clustering Algorithm Using Dynamic Nearest Neighbors Selection Model
下载PDF
导出
摘要 ROCK是Sudipno Guha等1999年提出的一个著名的面向分类属性数据的聚类算法,其突出贡献是采用公共近邻(链接)数的全局信息作为评价数据点间相关性的度量标准,而不是传统的基于两点间距离的局部度量函数.尽管ROCK在Mushroom等分类属性数据集上取得了很好的聚类结果,但该算法本身也存在一些缺陷和不足.首先,衡量两个数据点是否为邻居的相似度阈值θ需要预先静态指定,该阈值对聚类质量影响很大,在对数据集没有充分了解的前提下给出恰当的阈值是困难的.其次,在ROCK算法中,相似度函数sim仅被用于最初邻居的判断上,只考虑相似与否,而未考虑相似程度,使算法对θ值过于敏感.另外,ROCK还要求用户事先选定聚类簇数k.这些缺陷或者影响聚类效果,或使算法不便使用.该文深入分析了上述问题,并提出基于动态近邻选择模型的聚类算法DNNS,通过优选近邻来提高聚类质量.文中还定义了内聚度度量函数以指导聚类过程.对标准数据集VOTE和ZOO的实验结果表明,DNNS算法的fα指标优于ROCK和VBACC. ROCK, proposed by Sudipno Guha et al in 1999, is a well known, robust, categorical attribute oriented clustering algorithm. The main contribution of ROCK is the introduction of a novel concept called "common neighbors" (links) as similarity measure between a pair of data points. Compared with traditional distance-based approaches, links capture global information over the whole data set rather than local information between two data points. Despite its success in clustering some categorical databases such as Mushroom, there are still some underlying weaknesses. First, the user is required to select a similarity threshold θ, a value that can significantly influence final clustering results. Without sufficient prior-knowledge, it is difficult to make a proper choice of value θ. Second, similarity function sire is only used to judge neighbors and the degree of similarity is lost during the iterative process of clustering, making the algorithm sensitive to the value of θ. Third, the number of desired final clusters must be pre-specified, which is also difficult without fully understanding of the data set. These shortcomings either hinder the algorithm from achieving even better clustering result, or make the algorithm inconvenient to use. This paper investigates the above problems and proposes a novel algorithm named DNNS using Dynamic Nearest Neighbors Selection model, which improves clustering quality with an appropriate selection of nearest neighbors. A new cohesion measure also is discussed to control the clustering process. Experimental results on standard databases VOTE and ZOO demonstrate that DNNS outperforms ROCK and VBACC based on the evaluation metrics of fα.
作者 金阳 左万利
出处 《计算机学报》 EI CSCD 北大核心 2007年第5期756-762,共7页 Chinese Journal of Computers
基金 国家自然科学基金(60373099)资助.
关键词 数据挖掘 聚类 类别属性 data mining clustering categorical attributes
  • 相关文献

参考文献10

  • 1Dubes R C, Jain A K. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall, 1988.
  • 2Zhang Tian, Ramakrishnan Raghu, Livny Miron. Birch: An efficient data clustering method for very large databases// Proceedings of the ACM SIGMOD Conference on Management of Data. Montreal, Canada, 1996: 103-114.
  • 3Guha S, Rastogi R, Shim K. ROCK:A robust clustering algorithm for categorical attributes//Proceedings of the 15th International Conference on Data Engineering. Sydney, Australia, 1999:1-11.
  • 4Gupta G K, Ghosh J. Value balanced agglomerative connectivity clustering//Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery Ⅲ. Orlando, USA, 2001:6-15.
  • 5Dutta M, Kakoti Mahanta A, Pujari Arun K. QROCK: A quick version of the ROCK algorithm for clustering of categorical data. Pattern Recognition Letters, 2005, 26(15): 2364-2373.
  • 6Gehrke J. New research directions in KDD. Report on the SIGKDD 2001 Conference Panel, SIGKDD Explorations,2002, 3(2): 76-77.
  • 7Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. Minnesota: University of Minnesota, Technical Report: 00-034, 2002.
  • 8Sebastiani F. A tutorial on automatic text categorization// Proceedings of the 1st Argentinean Symposium on Artificial Intelligence (ASAI'99). Buenos Aires, AR, 1999:7-35.
  • 9Larsen B, Aone C. Fast and effective text mining using linear-time document clustering//Proceedings of the 5th ACM SIGKDD. San Diego, CA, 1999:16-22.
  • 10Kleinberg J, Papadimitriou C, Raghavan P. Segmentation problems//Proceedings of the 30th ACM Symposium on Theory of Computing. Duluth, MIN, USA, 1998:473-481.

同被引文献205

引证文献18

二级引证文献224

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部