摘要
针对分布式数据流中数据有交叠、不完整的情况和聚类需要较低通信代价的要求,提出了密度和模型聚类思想相结合的分布式数据流聚类算法DAM-Distream。该算法利用混合高斯模型描述数据流的分布概况,可以有效压缩数据量并能较好的反映分布数据流间的交叠性。由于获得模型参数的EM算法对初值敏感,应用Hoeffding界理论和基于密度的算法对数据流进行初聚类,得到比较准确的初始参数,最后采用合并近似模型策略获得全局模型。仿真实验结果表明,DAM-Distream能有效克服EM算法的缺点,获得的模型参数性能更优,在降低系统的通信代价的同时能提高分布式环境下数据流的聚类质量。
According to the condition that there are some overlap and missing data in distributed data streams, and to meet the needs of lower communication costs, DAM-Distream, a clustering algorithm combining density method and model method is proposed. The algorithm uses the Ganssian mixture model to describe the data streams flowing into the local distribution sites. However, Gaussian mixture model parameters are obtained by EM algorithm which is sensitive to initial value. DAM-Distream presents density based algorithm to cluster data streams at first, that is, to search the suitable initial parameters for Gaussian mixture model. Second, EM algorithm is used to iterative clustering, and then the algorithm determines. At last, the models are uploaded to the central site for the integrated treatment. Experimental results show that DAM-Distream can effectively overcome the shortcomings of the EM algorithm and obtain better parameters of GMM. Experiment show that it can improve the clustering quality of data streams in distributed systems and reduce the eommunl- cation cost of the system.
出处
《计算机工程与设计》
CSCD
北大核心
2011年第8期2708-2711,2763,共5页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2008AA011001)
关键词
分布式数据流
聚类
基于密度
基于模型
数据挖掘
distributed data streams
clustering
density-based
model- based
data mining