一种改进的CURE聚类算法

An Improved Clustering Approach of CURE

下载PDF

导出

摘要聚类分析是数据挖掘领域的一个重要研究方向。已经有多种用于大规模数据库的聚类算法,CURE就是一个典型的代表。本文对CURE进行了改进,新方法用多点表示一个类,但舍弃了代表点收缩的过程;通过对类内最邻近距离统计特征的分析,提出了自动分离子类的方法,因而不用预先给定聚类个数;在CURE对原始数据进行随机采样和分区聚类的基础上,增加了划分网格一步,能降低噪声影响并缩短聚类时间。对二维数据的测试表明:改进的CURE能正确识别大多数类,速度上优于原算法。 Clustering is an important tool of Data Mining. CURE is a classical hierarchical method that is designed for the mining of very large database. In this paper, CURE is improved in three aspects. We use several representatives to figure a cluster but abandon the shrinking process. After analyzing the statistical characteristics of a cluster's 1-DIST, we present a new cluster isolating criterion which can automatically determine the number of clusters. We add grid method together with the CURE's sampling and partitioning technique to deal with the original data. The grid method can not only dampen the impact of noise but also reduce the time needed for clustering. Experiments on 2-dimension datasets show that the improved CURE outperformed CURE in speed and the ability of discover arbitrary clusters with shapes.

作者郭俊樊彦国

机构地区石油大学(华东)资源与信息学院

出处《内蒙古石油化工》 CAS 2005年第4期14-17,共4页 Inner Mongolia Petrochemical Industry

关键词聚类算法大规模数据库研究方向数据挖掘聚类分析统计特征自动分离随机采样原始数据噪声影响二维数据代表点近距离子类网格 data mining hierarchical clustering representative objects CURE

分类号 TP391.41 [自动化与计算机技术—计算机应用技术] TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献2

1钱卫宁,周傲英.从多角度分析现有聚类算法(英文)[J].软件学报,2002,13(8):1382-1394. 被引量：86
2周水庚,周傲英,曹晶,胡运发.一种基于密度的快速聚类算法[J].计算机研究与发展,2000,37(11):1287-1292. 被引量：89

二级参考文献40

1[1]Fasulo, D. An analysis of recent work on clustering algorithms. Technical Report, Department of Computer Science and Engineering, University of Washington, 1999. http://www.cs.washington.edu.
2[2]Baraldi, A., Blonda, P. A survey of fuzzy clustering algorithms for pattern recognition. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 1999,29:786～801.
3[3]Keim, D.A., Hinneburg, A. Clustering techniques for large data sets - from the past to the future. Tutorial Notes for ACM SIGKDD 1999 International Conference on Knowledge Discovery and Data Mining. San Diego, CA, ACM, 1999. 141～181.
4[4]McQueen, J. Some methods for classification and Analysis of Multivariate Observations. In: LeCam, L., Neyman, J., eds. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967. 281～297.
5[5]Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: an efficient data clustering method for very large databases. In: Jagadish, H.V., Mumick, I.S., eds. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. Quebec: ACM Press, 1996. 103～114.
6[6]Guha, S., Rastogi, R., Shim, K. CURE: an efficient clustering algorithm for large databases. In: Haas, L.M., Tiwary, A., eds. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. Seattle: ACM Press, 1998. 73～84.
7[7]Beyer, K.S., Goldstein, J., Ramakrishnan, R., et al. When is 'nearest neighbor' meaningful? In: Beeri, C., Buneman, P., eds. Proceedings of the 7th International Conference on Data Theory, ICDT'99. LNCS1540, Jerusalem, Israel: Springer, 1999. 217～235.
8[8]Ester, M., Kriegel, H.-P., Sander, J., et al. A density-based algorithm for discovering clusters in large spatial databases with noises. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96). AAAI Press, 1996. 226～231.
9[9]Ester, M., Kriegel, H.-P., Sander, J., et al. Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 323～333.
10[10]Sander, J., Ester, M., Kriegel, H.-P., et al. Density-Based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998,2(2):169～194.

共引文献172

1刘英林,陈玉柱,丁文静,程红云.钢卷表面缺陷分布特征发现方法研究[J].冶金自动化,2020,44(1):27-31. 被引量：2
2毛颖颖,杨新凯.融合拓扑势的自适应层次聚类算法研究[J].计算机应用研究,2020,37(S01):37-39.
3李华,贾雪.基于FM度量的自适应K-Means聚类的工业生产运行基准挖掘[J].长春大学学报,2022,32(4):22-27.
4Qi Zhang,Jianshe Cao,Yanfeng Sui.Development of a research platform for BEPCⅡ accelerator fault diagnosis[J].Radiation Detection Technology and Methods,2020,4(3):269-276.
5郭景峰,赵玉艳,边伟峰,李晶.基于改进的凝聚性和分离性的层次聚类算法[J].计算机研究与发展,2008,45(z1):202-206. 被引量：15
6梁敏君,倪志伟,倪丽萍,杨葛钟啸.基于网格与分形维数的聚类算法[J].计算机应用,2009,29(3):830-832. 被引量：4
7周文勇.改进的K-均值聚类算法[J].光盘技术,2007(2):54-56. 被引量：6
8王海,王忠民.一种基于密度和网格的聚类算法在KDD中的应用[J].计算机工程与应用,2004,40(24):180-182. 被引量：3
9王建会,申展,胡运发.一种实用高效的聚类算法[J].软件学报,2004,15(5):697-705. 被引量：26
10周永权,焦李成.高属性维稀疏数据聚类回归逻辑神经网络模型及学习算法[J].电子学报,2004,32(8):1342-1345. 被引量：3

1郭俊,樊彦国.一种改进的CURE聚类算法[J].内蒙古石油化工,2005,31(8):12-15. 被引量：4
2单蓉.文本聚类算法的比较研究[J].内江科技,2008,29(12):49-49. 被引量：1
3董健康.数据挖掘中CURE聚类算法研究[J].电脑与电信,2007(4):14-15. 被引量：3
4白迪,赵龙.数据挖掘在电力负荷预测中的应用[J].计算机与信息技术,2007(5):21-23. 被引量：4
5冯兴杰,黄亚楼.增量式CURE聚类算法研究[J].小型微型计算机系统,2004,25(10):1847-1849. 被引量：9
6赵峰,秦锋,陈全.基于距离和的孤立点检测在税务系统中的应用[J].太原师范学院学报（自然科学版）,2009,8(3):42-45. 被引量：2
7王春才,杨华民,张彩虹,郭威,韩贵东.一种适用于数据仓库环境的增量聚类方法[J].河北大学学报（自然科学版）,2010,30(2):211-215.
8张扬武.基于训练集局部加权的C4.5算法改进研究[J].电脑知识与技术,2016,0(6):202-204.
9周亚建,徐晨,李继国.基于改进CURE聚类算法的无监督异常检测方法[J].通信学报,2010,31(7):18-23. 被引量：22
10杨长春,周猛,叶施仁,徐小松.基于改进CURE算法的微博热点话题发现[J].计算机仿真,2013,30(11):383-387. 被引量：12

内蒙古石油化工

2005年第4期

浏览历史

内容加载中请稍等...

一种改进的CURE聚类算法

参考文献2

二级参考文献40

共引文献172

相关作者

相关机构

相关主题

浏览历史