期刊文献+

一种基于主成分和密度的改进型动态数据流聚类算法 被引量:1

An Improved Clustering Algorithm for Dynamic Data Streams Based on Principal Component Analysis and Density
下载PDF
导出
摘要 本文主要研究了在有限资源约束下的数据流聚类方法。针对海量,高速的数据流,现有聚类方法在有界内存和有界时间的限制下,难以快速有效地进行聚类,设计了一种基于主成分和密度的动态数据流聚类算法,PDStream算法.它采用滑动窗口管理数据流;首先使用主成分模型作为前置系统,它负责对基本窗口内的源数据进行属性转换,起到了降维的作用;然后使用密度聚类模型作为后置系统进行聚类操作;最后对系统中生成的概要数据进行简化的二次聚类并更新聚类簇。通过实验表明,PDStream算法有效克服了STREAM算法使得聚类受控于历史数据的缺点,显现出处理海量数据的优越性以及聚类质量高的特点。 The data stream clustering method in the constraints of limited resources is investigated in this paper.In view of massive,high-speed data streams,the existing clustering methods are difficult to carry out rapid and effective clustering with bounded memory and time,an improved clustering PDStream algorithm for dynamic data streams based on principal component analysis and density is designed.It adopts sliding window to manage data streams.First,the pre-system makes use of principal component model to convert properties of the source data in the basic window,which plays a role of dimensionality reduction; Second,the post-system chooses the density model to execute clustering operation;Finally,the summary date generated in the aforementioned steps is required to execute simply second clustering and update the clustering result.Experiments show that PDStream algorithm effectively overcomes the shortcomings of the STREAM algorithm controlled by historical data and has the superiority of handling mass data and the characteristics of high-quality clustering.
出处 《情报学报》 CSSCI 北大核心 2010年第4期579-585,共7页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学基金(编号:70671094) 浙江省自然科学基金重点项目(编号:Z1091224) 浙江省自然科学基金(编号:Y1090617) 浙江省科技计划项目(编号:2009C13G2050020)
关键词 数据流聚类 主成分分析 密度 滑动窗口 data stream principal component analysis density sliding window
  • 相关文献

参考文献14

  • 1Muthukrishnan S.Data streams algorithms and applications[C] // Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms.Philadelphia:Society for Industrial and applied Mathematics,2003:413-423.
  • 2Guha S,Koudas N.Approximating a data stream for querying and estimation:algorithms and performance evaluation[C] // Proceedings of the 18th International Conference on Data Engineering (ICDE).San Jose,California,USA,2002:567-576.
  • 3Chalaghan L O,Mishra N,Meyerson A,Guha S.Streaming data algorithms for high-quality clustering[C] // Proceedings of the 18th Int'l Conf.on Data Engineering.San Jose,2002:685-694.
  • 4Domingos P,Hulten C.Mining high-speed data streams[C] // Proceedings of the KDD,2000.
  • 5Aggarwal C C,Han J,Wang J,et al.A framework for projected clustering of high dimensional data streams[C] // Proceedings of the VLDB.Toronto,Morgan Kaufmann Publishers,2004:852-863.
  • 6常建龙,曹锋,周傲英+.基于滑动窗口的进化数据流聚类[J].软件学报,2007,18(4):905-918. 被引量:61
  • 7朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法[J].软件学报,2006,17(3):379-387. 被引量:51
  • 8Guha S,Mishra N,Motwani R,et al.Clustering data streams[C] //FOCS,2000:359-366.
  • 9Aggarwal C,Han J,Wang J,et al.A framework for clustering evolving data streams[C] // Proceedings of the 29th International Conference on Very Large Databases.Berlin,Germany:Morgan Kaufmann Publishers,2003:81-92.
  • 10王顺绪,戴华.求解大型矩阵特征值问题的并行块Davidson方法[J].南京航空航天大学学报,2007,39(6):814-818. 被引量:2

二级参考文献53

  • 1刘高军,朱嬿.基于数据挖掘技术的建筑企业信用评价[J].中国矿业大学学报,2005,34(4):494-499. 被引量:21
  • 2程建钢,李明瑞,黄文彬.有限元分析的并行计算方法[J].力学与实践,1995,17(4):6-12. 被引量:9
  • 3Golab L,Ozsu MT.Issues in data stream management.SIGMOD Record,2003,32(2):5-14.
  • 4Babcock B,Babu S,Datar M,Motwani R,Widom J.Models and issues in data stream systems.In:Proc.of the 21st ACM SIGMOD-SIGACT-SIGART Symp.on Principles of Database Systems.2002.1-16.
  • 5Barbará D.Requirements for clustering data streams.ACM SIGKDD Explorations Newsletter,2003,3(2):23-27.
  • 6Aggarwal C,Han J,Wang J,Yu PS.A framework for clustering evolving data streams.In:VLDB 2003.2003.81-92.
  • 7Guha S,Mishra N,Motwani R,O'Callaghan L.Clustering data streams.In:FOCS 2000.2000.359-366.
  • 8O'Callaghan L,Mishra N,Meyerson A,Guha S.Streaming-Data algorithms for high-quality clustering.In:ICDE Conf.2002.685-704.
  • 9Zhang T,Ramakrishnan R,Livny M.BIRCH:An efficient data clustering method for very large databases.In:SIGMOD'96.1996.103-114.
  • 10Hah J,Kamber M.Data Mining-Concepts and Techniques.Beijing:Higher Education Press,Morgan Kaufmann Publishers,2001.

共引文献187

同被引文献15

  • 1章舜仲,王树梅,黄河燕,陈肇雄.基于属性相关性分析的贝叶斯分类模型[J].情报学报,2007,26(2):271-274. 被引量:11
  • 2Choi S-S, Cha S-H, Tappert C C. A survey of binary similarity and distance measures [ J ]. Journal of Systemics, Cybernetics&Informatics, 2010,8 ( 1 ) :42-48.
  • 3Pandit S, Gupta S. A comparative study on distance measuring approaches for clustering [ J ]. International Journal of Research in Computer Science, 2011,2 (1) : 29-31.
  • 4Huang Z. Clustering large data sets with mixed numeric and categorical values [ C ]// Proceedings of the 1stPacific-Asia Conference on Knowledge Discovery and Data Mining, ( PAKDD), 1997:21-34.
  • 5Priya P I ,Ghosh D. K-means Clustering Algorithm Characteristics Differences based on Distance Measurement [ J ]. Internationod Journal of Computer Applications,2012,59 (14) :12-14.
  • 6Chen X, Ye Y, Xu X, et al. A feature group weighting method for subspace clustering of high-dimensional data [ J]. Pattern Recognition, 2012,45 ( 1 ) :434-446.
  • 7Ding C,He X. K-means clustering via principal component analysis [ C ]//Proceedings of the twenty-first international conference on Machine learning, 2004:29.
  • 8Lu W-Z, He H-D, Dong L-Y. Performance assessment of air quality monitoring networks using principal component analysis and cluster analysis [ J ]. Building and Environment ,2011,46 ( 3 ) :577-583.
  • 9Kriegel H-P, Kruger P, Schubert E, et al. A general framework for increasing the robustness of PCA-based correlation clustering algorithms [ C ]//. Scientific and Statistical Database Management, 2008:418-435.
  • 10D'Enza A I, Palumbo F, Iterative factor clustering of binary data [ J ]. Computational Statistics, 2013, 28 (2) :789-807.

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部