一种基于主成分和密度的改进型动态数据流聚类算法被引量：1

An Improved Clustering Algorithm for Dynamic Data Streams Based on Principal Component Analysis and Density

下载PDF

导出

摘要本文主要研究了在有限资源约束下的数据流聚类方法。针对海量,高速的数据流,现有聚类方法在有界内存和有界时间的限制下,难以快速有效地进行聚类,设计了一种基于主成分和密度的动态数据流聚类算法,PDStream算法.它采用滑动窗口管理数据流;首先使用主成分模型作为前置系统,它负责对基本窗口内的源数据进行属性转换,起到了降维的作用;然后使用密度聚类模型作为后置系统进行聚类操作;最后对系统中生成的概要数据进行简化的二次聚类并更新聚类簇。通过实验表明,PDStream算法有效克服了STREAM算法使得聚类受控于历史数据的缺点,显现出处理海量数据的优越性以及聚类质量高的特点。 The data stream clustering method in the constraints of limited resources is investigated in this paper.In view of massive,high-speed data streams,the existing clustering methods are difficult to carry out rapid and effective clustering with bounded memory and time,an improved clustering PDStream algorithm for dynamic data streams based on principal component analysis and density is designed.It adopts sliding window to manage data streams.First,the pre-system makes use of principal component model to convert properties of the source data in the basic window,which plays a role of dimensionality reduction; Second,the post-system chooses the density model to execute clustering operation;Finally,the summary date generated in the aforementioned steps is required to execute simply second clustering and update the clustering result.Experiments show that PDStream algorithm effectively overcomes the shortcomings of the STREAM algorithm controlled by historical data and has the superiority of handling mass data and the characteristics of high-quality clustering.

作者琚春华梅铮许翀寰

机构地区浙江工商大学计算机与信息工程学院

出处《情报学报》 CSSCI 北大核心 2010年第4期579-585,共7页 Journal of the China Society for Scientific and Technical Information

基金国家自然科学基金（编号：70671094）浙江省自然科学基金重点项目（编号：Z1091224）浙江省自然科学基金（编号：Y1090617）浙江省科技计划项目（编号：2009C13G2050020）

关键词数据流聚类主成分分析密度滑动窗口 data stream principal component analysis density sliding window

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1Muthukrishnan S.Data streams algorithms and applications[C] // Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms.Philadelphia:Society for Industrial and applied Mathematics,2003:413-423.
2Guha S,Koudas N.Approximating a data stream for querying and estimation:algorithms and performance evaluation[C] // Proceedings of the 18th International Conference on Data Engineering (ICDE).San Jose,California,USA,2002:567-576.
3Chalaghan L O,Mishra N,Meyerson A,Guha S.Streaming data algorithms for high-quality clustering[C] // Proceedings of the 18th Int'l Conf.on Data Engineering.San Jose,2002:685-694.
4Domingos P,Hulten C.Mining high-speed data streams[C] // Proceedings of the KDD,2000.
5Aggarwal C C,Han J,Wang J,et al.A framework for projected clustering of high dimensional data streams[C] // Proceedings of the VLDB.Toronto,Morgan Kaufmann Publishers,2004:852-863.
6常建龙,曹锋,周傲英+.基于滑动窗口的进化数据流聚类[J].软件学报,2007,18(4):905-918. 被引量：61
7朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法[J].软件学报,2006,17(3):379-387. 被引量：51
8Guha S,Mishra N,Motwani R,et al.Clustering data streams[C] //FOCS,2000:359-366.
9Aggarwal C,Han J,Wang J,et al.A framework for clustering evolving data streams[C] // Proceedings of the 29th International Conference on Very Large Databases.Berlin,Germany:Morgan Kaufmann Publishers,2003:81-92.
10王顺绪,戴华.求解大型矩阵特征值问题的并行块Davidson方法[J].南京航空航天大学学报,2007,39(6):814-818. 被引量：2

二级参考文献53

1刘高军,朱嬿.基于数据挖掘技术的建筑企业信用评价[J].中国矿业大学学报,2005,34(4):494-499. 被引量：21
2程建钢,李明瑞,黄文彬.有限元分析的并行计算方法[J].力学与实践,1995,17(4):6-12. 被引量：9
3Golab L,Ozsu MT.Issues in data stream management.SIGMOD Record,2003,32(2):5-14.
4Babcock B,Babu S,Datar M,Motwani R,Widom J.Models and issues in data stream systems.In:Proc.of the 21st ACM SIGMOD-SIGACT-SIGART Symp.on Principles of Database Systems.2002.1-16.
5Barbará D.Requirements for clustering data streams.ACM SIGKDD Explorations Newsletter,2003,3(2):23-27.
6Aggarwal C,Han J,Wang J,Yu PS.A framework for clustering evolving data streams.In:VLDB 2003.2003.81-92.
7Guha S,Mishra N,Motwani R,O'Callaghan L.Clustering data streams.In:FOCS 2000.2000.359-366.
8O'Callaghan L,Mishra N,Meyerson A,Guha S.Streaming-Data algorithms for high-quality clustering.In:ICDE Conf.2002.685-704.
9Zhang T,Ramakrishnan R,Livny M.BIRCH:An efficient data clustering method for very large databases.In:SIGMOD'96.1996.103-114.
10Hah J,Kamber M.Data Mining-Concepts and Techniques.Beijing:Higher Education Press,Morgan Kaufmann Publishers,2001.

共引文献188

1宗长富,文龙,何磊.基于欧几里得聚类算法的三维激光雷达障碍物检测技术[J].吉林大学学报（工学版）,2020,50(1):107-113. 被引量：24
2毕方明,张虹,曹天杰.非均匀Hilbert曲线的生成算法[J].中国矿业大学学报,2009,38(5):729-734. 被引量：3
3忻凌,倪志伟,黄玲.基于数据流的BIRCH改进聚类算法[J].计算机工程与应用,2007,43(5):166-168. 被引量：6
4付长龙,吕彦波,姚全珠,杜旭辉.基于样本密度的SVM及其在入侵检测中的应用[J].计算机应用,2007,27(4):838-840. 被引量：1
5刘青宝,戴超凡,邓苏,张维明.基于网格的数据流聚类算法[J].计算机科学,2007,34(3):159-161. 被引量：10
6王志坚,魏定国,吴时霖.基于Petri网统一模型的系统开发方法研究[J].系统仿真学报,2007,19(A01):175-178.
7邓维维,彭宏.一种新的演化文本流聚类算法[J].计算机科学,2007,34(9):125-127.
8史金成,胡学钢.数据流挖掘研究[J].计算机技术与发展,2007,17(11):11-14. 被引量：6
9黄孝.数据流聚类算法分析[J].池州学院学报,2007,21(5):11-13. 被引量：1
10朱启家,张伟,陈春燕.高斯混合密度降解模型在数据流聚类中的应用[J].江南大学学报（自然科学版）,2007,6(6):891-894. 被引量：1

同被引文献15

1章舜仲,王树梅,黄河燕,陈肇雄.基于属性相关性分析的贝叶斯分类模型[J].情报学报,2007,26(2):271-274. 被引量：11
2Choi S-S, Cha S-H, Tappert C C. A survey of binary similarity and distance measures [ J ]. Journal of Systemics, Cybernetics&Informatics, 2010,8 ( 1 ) :42-48.
3Pandit S, Gupta S. A comparative study on distance measuring approaches for clustering [ J ]. International Journal of Research in Computer Science, 2011,2 (1) : 29-31.
4Huang Z. Clustering large data sets with mixed numeric and categorical values [ C ]// Proceedings of the 1stPacific-Asia Conference on Knowledge Discovery and Data Mining, ( PAKDD), 1997:21-34.
5Priya P I ,Ghosh D. K-means Clustering Algorithm Characteristics Differences based on Distance Measurement [ J ]. Internationod Journal of Computer Applications,2012,59 (14) :12-14.
6Chen X, Ye Y, Xu X, et al. A feature group weighting method for subspace clustering of high-dimensional data [ J]. Pattern Recognition, 2012,45 ( 1 ) :434-446.
7Ding C,He X. K-means clustering via principal component analysis [ C ]//Proceedings of the twenty-first international conference on Machine learning, 2004:29.
8Lu W-Z, He H-D, Dong L-Y. Performance assessment of air quality monitoring networks using principal component analysis and cluster analysis [ J ]. Building and Environment ,2011,46 ( 3 ) :577-583.
9Kriegel H-P, Kruger P, Schubert E, et al. A general framework for increasing the robustness of PCA-based correlation clustering algorithms [ C ]//. Scientific and Statistical Database Management, 2008:418-435.
10D'Enza A I, Palumbo F, Iterative factor clustering of binary data [ J ]. Computational Statistics, 2013, 28 (2) :789-807.

引证文献1

1李保珍,张亭亭.成对属性关联分析及其属性空间构建[J].情报学报,2014,33(11):1194-1203. 被引量：2

二级引证文献2

1赵小兰.高等学校人才统筹能力的培养策略分析[J].黑龙江高教研究,2017,35(5):143-145. 被引量：1
2施伟锋,卓金宝,兰莹.一种基于属性空间相似性的模糊聚类算法[J].电子与信息学报,2019,41(11):2722-2728. 被引量：13

1程军锋.数据流挖掘中的聚类技术[J].衡水学院学报,2015,17(1):16-18.
2程军锋,王治和,刘佳,潘丽娜.一种基于滑动窗口的一趟数据流聚类算法[J].首都师范大学学报（自然科学版）,2014,35(4):38-40. 被引量：1
3刘三民,王忠群,刘涛,修宇.融合互近邻降噪的动态数据流分类研究[J].计算机科学与探索,2016,10(1):36-42. 被引量：5
4王丹.数据流概要数据的合并性研究分析[J].无线互联科技,2013,10(11):95-95.
5程军锋.数据流挖掘技术研究[J].洛阳师范学院学报,2014,33(2):37-39. 被引量：1
6刘学军,胡平,徐宏炳,董逸生,钱江波,王永利.基于硬件加速的高速数据流连续实时聚集查询[J].电子学报,2007,35(2):228-233. 被引量：2
7许颖梅.滑动窗口内动态数据流聚类算法研究[J].陕西理工学院学报（自然科学版）,2014,30(1):42-46.
8许颖梅.基于滑动窗口的动态数据流聚类算法研究[J].河南科学,2014,32(5):777-780.
9王继伦.利用静态数据流和动态数据流分析故障[J].科技信息,2011(34):178-178.
10何军,周明天,李幼平.一种Web信息的三级发布体系结构[J].系统工程与电子技术,2001,23(3):88-90.

情报学报

2010年第4期

浏览历史

内容加载中请稍等...

一种基于主成分和密度的改进型动态数据流聚类算法被引量：1

参考文献14

二级参考文献53

共引文献188

同被引文献15

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种基于主成分和密度的改进型动态数据流聚类算法 被引量：1

参考文献14

二级参考文献53

共引文献188

同被引文献15

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

一种基于主成分和密度的改进型动态数据流聚类算法被引量：1