期刊文献+

基于概率数据流的有效聚类算法 被引量:15

Effective Clustering Algorithm for Probabilistic Data Stream
下载PDF
导出
摘要 提出一种在概率数据流上进行聚类的有效方法P-Stream.P-Stream针对数据流上的概率元组提出强簇、过渡簇和弱簇的概念,设计一种有效的在线候选簇选择策略,为每个不断到达的数据元组合理地找到可能归属的簇,并在每个检查点存储微簇快照,以便离线进一步高层聚类和演化分析.最后设计一个"积极"的二层聚类模型来判断现有的第1层聚类模型是否还适应数据流中最近到达的概率元组.实验采用KDD-CUP’98和KDD-CUP’99真实数据集以及变换高斯分布的人工数据集构造概率数据流.实验结果表明,P-Stream具有良好的聚类质量、较快的处理速度,能够有效地适应数据演化情况. An effective clustering algorithm called "P-Stream" for probabilistic data stream is developed in this paper for the first time. For the uncertain tuples in the data stream, the concepts of strong cluster, transitional clusters and weak cluster are proposed in the P-Stream. With these concepts, an effective strategy of choosing candidate cluster is designed, which can find the sound cluster for every continuously arriving data point. Then, in order to further cluster on the high level and analyze the evolving behaviors of data streams, snapshots ot micro-clusters are stored at every checkpoint. At last, an "aggressive" two-tier clustering model is introduced to judge whether the most recently arrived data point is fitting in with the first level clustering model or not. Probabilistic data streams in the experiments include KDD-CUP'98 and KDD-CUP'99 real data sets and synthetic data sets with changing Gaussian distributions. Comprehensive experimental results demonstrate that P-Stream is ot high quality, fast processing rate and is efficiently fitting in with the evolving situations of data streams.
出处 《软件学报》 EI CSCD 北大核心 2009年第5期1313-1328,共16页 Journal of Software
基金 国家重点基础研究发展计划(973)No.2005CB321905~~
关键词 概率数据流 聚类 演化分析 probabilistic data stream clustering evolving analysis
  • 相关文献

参考文献22

  • 1Cormode G, Garofalakis M. Sketching probabilistic data streams. In: Chan CY, Ooi BC, Zhou A, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Beijing: ACM Press, 2007. 281-292.
  • 2Jayram TS, McGregor A, Muthukrishan, Vee E. Estimating statistical aggregates on probabilistic data streams. In: Libkin L, ed. Proc. of the 26th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems. Beijing: ACM Press, 2007. 243-252.
  • 3Jayram TS, Kale S, Vee E. Efficient aggregation algorithms for probabilistic data. In: Bansal N, Pruhs K, Stein C, eds. Proc. of the 18th Annual ACM-SIAM Syrup. on Discrete Algorithms (SODA). New Orleans: SIAM, 2007. 346-355.
  • 4Aggarwal CC, Han J, Yu PS. A framework for clustering evolving data streams. In: Freytag JC, Lockmann PC, Abiteboul S, Carey MJ, Seling PG, Heuer A, eds. Proc. of the Int'l Conf. on Very Large Data Bases. Berlin: Morgan Kaufmann Publishers, 2003. 81-92.
  • 5Dalvi N, Suciu D. Efficient query evaluation on probabilistic databases. In: Nascimento MA, Ozsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB, eds. Proe. of the VLDB. Toronto: Morgan Kaufmarm Publishers, 2004. 864-875.
  • 6Burdick D, Deshpande PM, Jayram TS, Ramakrishnan R, Vaithyanathan S. OLAP over uncertain and imprecise data. In: Bohm K, Jensen CS, Haas LM, Kersten ML, Larson P, Ooi BC, eds. Proc. of the Int'l Conf. on Very Large Data Bases. Trondheim: ACM Press, 2005.970-981.
  • 7Sarma AD, Benjelloum O, Halevy A, Widom J. Working models for uncertain data. In: Liu L, Reuter A, Whang KY, Zhang J, eds. Proc. of the 22nd Int'l Conf. on Data Engineering. Atlanta: IEEE Computer Society, 2006.
  • 8Cheng R, Kalashnikov D, Prabhakar S. Querying imprecise data in moving object environments. IEEE Trans. on Knowledge and Data Engineering, 2004,16(9):1112-1127.
  • 9Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY. Efficient clustering of uncertain data. In: Cliton CW, Zhong M, Liu JM, Wah BW, Wu XD, eds. Proc. of the 6th IEEE Int'l Conf. on Data Mining. Hong Kong: IEEE Computer Society, 2006. 436-445.
  • 10Guha S, Mishra N, Motwani R, Callaghan LO. Clustering data streams. In: Yong DC, ed. Proe. of the 41st Annual Symp. on Foundations of Computer Science. Redondo Beach: IEEE Computer Society, 2000. 359-366.

同被引文献98

引证文献15

二级引证文献50

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部