摘要
本文主要研究了在有限资源约束下的数据流聚类方法。针对海量,高速的数据流,现有聚类方法在有界内存和有界时间的限制下,难以快速有效地进行聚类,设计了一种基于主成分和密度的动态数据流聚类算法,PDStream算法.它采用滑动窗口管理数据流;首先使用主成分模型作为前置系统,它负责对基本窗口内的源数据进行属性转换,起到了降维的作用;然后使用密度聚类模型作为后置系统进行聚类操作;最后对系统中生成的概要数据进行简化的二次聚类并更新聚类簇。通过实验表明,PDStream算法有效克服了STREAM算法使得聚类受控于历史数据的缺点,显现出处理海量数据的优越性以及聚类质量高的特点。
The data stream clustering method in the constraints of limited resources is investigated in this paper.In view of massive,high-speed data streams,the existing clustering methods are difficult to carry out rapid and effective clustering with bounded memory and time,an improved clustering PDStream algorithm for dynamic data streams based on principal component analysis and density is designed.It adopts sliding window to manage data streams.First,the pre-system makes use of principal component model to convert properties of the source data in the basic window,which plays a role of dimensionality reduction; Second,the post-system chooses the density model to execute clustering operation;Finally,the summary date generated in the aforementioned steps is required to execute simply second clustering and update the clustering result.Experiments show that PDStream algorithm effectively overcomes the shortcomings of the STREAM algorithm controlled by historical data and has the superiority of handling mass data and the characteristics of high-quality clustering.
出处
《情报学报》
CSSCI
北大核心
2010年第4期579-585,共7页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金(编号:70671094)
浙江省自然科学基金重点项目(编号:Z1091224)
浙江省自然科学基金(编号:Y1090617)
浙江省科技计划项目(编号:2009C13G2050020)
关键词
数据流聚类
主成分分析
密度
滑动窗口
data stream
principal component analysis
density
sliding window