摘要
提出了纳伪(false positive)和拒真(false negative)两种聚类特征指数直方图分别来支持纳伪误差和拒真误差窗口的聚类分析;然后,提出一种基于滑动窗口的数据流聚类方法.该方法在占用窗口大小的次线性内存空间前提下,及时保存最近数据记录的分布状况,从而实现对滑动窗口内的数据进行聚类.此外,它还可被扩展用于N-n窗口(滑动窗口的扩展模型)的数据聚类.实验采用KDD-CUP’99和KDD-CUP’98真实数据集以及变换高斯分布的人工数据集构造进化数据流.理论分析和实验结果表明,该方法具有良好的聚类质量、较小的内存开销和快速的数据处理能力.
To address the sliding window based clustering, two types of exponential histogram of cluster features, false positive and false negative, are introduced in this paper. With these structures, a clustering algorithm based on sliding windows is proposed. The algorithm can precisely obtain the distribution of recent records with limited memory, thus it can produce the clustering result over sliding windows. Furthermore, it can be extended to deal with the clustering problem over N-n window (an extended model of the sliding window). The evolving data streams in the experiments include KDD-CUP'99 and KDD-CUP'98 real data sets and synthetic data sets with changing Gaussian distribution. Theoretical analysis and comprehensive experimental results demonstrate that the proposed method is of high quality, little memory and fast processing rate.
出处
《软件学报》
EI
CSCD
北大核心
2007年第4期905-918,共14页
Journal of Software
基金
SupportedbytheNationalNaturalScienceFoundationofChinaunderGrantNos.60496325
60496327(国家自然科学基金)
关键词
进化数据流
聚类
滑动窗口
evolving data stream
clustering
sliding window