分布式数据流聚类算法被引量：2

Clustering algorithm over distributed data stream

下载PDF

导出

摘要针对分布式数据流中数据有交叠、不完整的情况和聚类需要较低通信代价的要求,提出了密度和模型聚类思想相结合的分布式数据流聚类算法DAM-Distream。该算法利用混合高斯模型描述数据流的分布概况,可以有效压缩数据量并能较好的反映分布数据流间的交叠性。由于获得模型参数的EM算法对初值敏感,应用Hoeffding界理论和基于密度的算法对数据流进行初聚类,得到比较准确的初始参数,最后采用合并近似模型策略获得全局模型。仿真实验结果表明,DAM-Distream能有效克服EM算法的缺点,获得的模型参数性能更优,在降低系统的通信代价的同时能提高分布式环境下数据流的聚类质量。 According to the condition that there are some overlap and missing data in distributed data streams, and to meet the needs of lower communication costs, DAM-Distream, a clustering algorithm combining density method and model method is proposed. The algorithm uses the Ganssian mixture model to describe the data streams flowing into the local distribution sites. However, Gaussian mixture model parameters are obtained by EM algorithm which is sensitive to initial value. DAM-Distream presents density based algorithm to cluster data streams at first, that is, to search the suitable initial parameters for Gaussian mixture model. Second, EM algorithm is used to iterative clustering, and then the algorithm determines. At last, the models are uploaded to the central site for the integrated treatment. Experimental results show that DAM-Distream can effectively overcome the shortcomings of the EM algorithm and obtain better parameters of GMM. Experiment show that it can improve the clustering quality of data streams in distributed systems and reduce the eommunl- cation cost of the system.

作者刘力雄郭云飞康晶马宏

机构地区国家数字交换系统工程技术研究中心

出处《计算机工程与设计》 CSCD 北大核心 2011年第8期2708-2711,2763,共5页 Computer Engineering and Design

基金国家863高技术研究发展计划基金项目(2008AA011001)

关键词分布式数据流聚类基于密度基于模型数据挖掘 distributed data streams clustering density-based model- based data mining

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献10

1康晶,马宏,刘力雄.基于密度的优化数据流聚类算法[J].计算机工程与设计,2010,31(22):4756-4759. 被引量：3
2周晓云,孙志挥,张柏礼,杨宜东.高维数据流聚类及其演化分析研究[J].计算机研究与发展,2006,43(11):2005-2011. 被引量：9
3岳佳,王士同.高斯混合模型聚类中EM算法及初始化的研究[J].微计算机信息,2006,22(11X):244-246. 被引量：51
4ZHANG Xiang liang,,Cyril FURTLEHNER,Mich le SEBAG.Distributed and incremental clustering based on weighted affi-nity propagation[].Proceedings of the Fourth Starting AI Resea-rchers’’Symposium.2008
5Manjhi A,Shkapenyuk V,Dhamdhere K,et al.Finding(recently)frequent items in distributed data streams[].Proc of thest Int Conf on Data Engineering.2005
6Januzaj E,Kriegel H P,Pfeifle M.Towards effective and effi-cient distributed clustering[].Proceeding of International Work-shop on Clustering Large Data Setsrd IEEEInternational Con-ference on Data Mining(ICDM).2003
7Cao F,Estery M,Qian W.Density-based Clustering over an Evolving Data Stream with Noise[].Proceedings of the SIAM Conference on Data Mining (SDM’’).2006
8CHERTY,TUL.Density-based clustering for real-time stream data[].Proceedings of the th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining.2007
9FREY B,DUECK D.Clustering by passing messages betweendata points[].Science.2007
10Reynolds DA.Speaker Identification and Verification Using Gaussian Mixture Speaker Models[].Speech Communication.1995

二级参考文献22

1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量：161
2朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法[J].软件学报,2006,17(3):379-387. 被引量：51
3周晓云,孙志挥,张柏礼,杨宜东.高维数据流聚类及其演化分析研究[J].计算机研究与发展,2006,43(11):2005-2011. 被引量：9
4Babcock S Babu,M Datar,et al.Models and issues in data stream systems[C].In:Proc of the 21st ACM Symp on Principles of Database Systems.New York:ACM Press,2002.1-16
5S Guha,N Mishra,R Motwani,et al.Clustering data streams:Theory and practice[J].IEEE TKDE Special Issue on Clustering,2003,3(2):37-46
6C Aggarwal,J Han,J Wang,et al.A framework for clustering evolving data streams[C].In:Proc of the 29th Int'l Conf on Very Large Data Base.San Francisco:Morgan Kaufmann,2003.81-92
7C Aggarwal,J Han,J Wang,et al.A framework for projected clustering of high dimensional data streams[C].In:Proc of the 30th Int'l Conf on Very Large Data Base.San Francisco:Morgan Kaufmann,2004.852-863
8O Nasraoui,C C Uribe,C R Coronel.TECNO-STREAMS:Tracking evolving clusters in noisy data streams with a scalable immune system learning model[C].In:Proc of the 3rd IEEE Int'l Conf on Data Mining.Los Alamitos,CA:IEEE Computer Society Press,2003.19-22
9孙焕良赵法信鲍玉斌等.CD—Stream——一种基于空间划分的流数据密度聚类算法[J].计算机研究与发展,2004,41:289-294.
10C Aggarwal,J Han,J Wang,et al.On demand classification of data streams[C].In:Proc of the 10th ACM SIGKDD Int'l Conf on Knowledge Discovery and Data Mining.New York:ACM Press,2004.503-508

共引文献60

1何鹏,楚艳红.基于数据挖掘的温室多参数控制算法的研究[J].农机化研究,2012,34(10):180-183. 被引量：1
2黄孝.数据流聚类算法分析[J].池州学院学报,2007,21(5):11-13. 被引量：1
3朱启家,张伟,陈春燕.高斯混合密度降解模型在数据流聚类中的应用[J].江南大学学报（自然科学版）,2007,6(6):891-894. 被引量：1
4夏英,鲁宁,丰江帆.二元数据子空间聚类算法的初始化研究[J].计算机应用研究,2009,26(1):47-49. 被引量：2
5张晓龙,曾伟.实时数据流聚类的研究新进展[J].计算机工程与设计,2009,30(9):2177-2181. 被引量：5
6付淇,黎虹,李广振.流数据聚类研究综述[J].科技广场,2010(1):237-240.
7闫光辉,董晓慧,刘云,贺少领,马志程.自适应分形聚类进化甄别算法[J].计算机科学与探索,2010,4(7):662-672.
8康晶,马宏,刘力雄.基于密度的优化数据流聚类算法[J].计算机工程与设计,2010,31(22):4756-4759. 被引量：3
9徐冰,李景文.基于独立混合模型的EM算法参数初始化实现方法[J].信号处理,2010,26(12):1877-1882. 被引量：2
10施海滨,周勇.混合聚类彩色图像分割方法研究[J].计算机工程与应用,2011,47(9):181-184. 被引量：8

同被引文献19

1孙玉芬,卢炎生.流数据挖掘综述[J].计算机科学,2007,34(1):1-5. 被引量：36
2范明,孟小峰.数据挖掘概念与技术[M].2版.北京:机械工业出版社,2007:195-196.
3胡仲义,郭超,王永炎,等.基于时间衰减和特征变量的数据流聚类算法[J].计算机研究与发展,2012,49(S1):155-162.
4NTOUTSI I, ZIMEK A, PALPANAS T, et al. Density-based pro- jected clustering over high dimensional data streams[ C]// Proceed- ings of the 6th International Conference on Scalable Uncertainty Management, LNCS 7520. Piscataway, NJ: IEEE Press, 2012:311 - 324.
5GAO B, ZHANG J. Density based distribute data stream clustering algorithm[J]. Journal of Software, 2013, 8(2) : 435 -442.
6HUANG J H, ZHANG J Y. Fuzzy C-means clustering algorithm with spatial constraints for distributed WSN data stream[ J]. International Journal of Advancements in Computing Technology, 2011,3 (2) : 165 - 175.
7SAMATOVA N F, GEIST A, OSTROUCHOV G, et al. Parallel out-of-core algorithm for genome-scale enumeration of metabolic sys- temic pathways[ C]//IPDPS 2002: Proceedings of the 16th Interna- tional Parallel and Distributed Processing Symposium. Washington, DC: 1EEE Computer Society, 2002: 249.
8JANUZAJ E, KRIEGEL H P, PFEIFLE M. DBDC: Density based distributed clustering[ C]//Advances in Database Technology-EDBT 2004. Berlin: Springer, 2004: 88- 105.
9JANUZAJ E, KR1EGEL H P, PFEIFLE M. Scalable density-based distributed clustering[ C]//Knowledge Discovery in Databases: PK- DD2004. Berlin: Springer, 2004:231-244.
10ZHOU A, CAO F, YAN Y, et al. Distributed data stream cluste- ring: a fast EM-based approach[ C]// ICDE 2007: Proceedings of the 23rd IEEE International Conference on Data Engineering. Pisea- taway, NJ: IEEE Press, 2007:736-745.

引证文献2

1张建朋,金鑫,陈福才,陈鸿昶,侯颖.基于近邻传播的分布式数据流聚类算法[J].计算机应用,2013,33(9):2477-2481. 被引量：3
2程军锋.数据流挖掘技术研究[J].洛阳师范学院学报,2014,33(2):37-39. 被引量：1

二级引证文献4

1马飞,李娟.一种基于位置指纹的WLAN攻击检测与定位方法[J].计算机应用与软件,2015,32(9):306-309.
2唐颖峰,陈世平.一种基于网格块的分布式数据流聚类算法[J].小型微型计算机系统,2016,37(3):488-493. 被引量：4
3张铭,王富章,程超.城市轨道交通设备故障聚类与贝叶斯网络预警[J].计算机工程与应用,2016,52(11):259-264. 被引量：7
4吴陈,孙宏.一种对数据流进行聚类的改进算法[J].电子设计工程,2017,25(22):23-25. 被引量：1

1许洪玮,曹江中,何家峰,戴青云.基于密度与路径的稳健谱聚类[J].计算机工程与应用,2015,51(2):165-170. 被引量：1
2杨宜东,孙志挥,张净.基于核密度估计的分布数据流离群点检测[J].计算机研究与发展,2005,42(9):1498-1504. 被引量：9
3王树广.分布式数据流上的连续异常检测[J].微电子学与计算机,2008,25(9):158-160. 被引量：1
4陈国初,徐余法,李承阳.微粒群优化算法参数性能实验分析[J].上海电机学院学报,2007,10(2):86-92. 被引量：3
5刘光亚,彭维娜.PID参数性能分析及改进[J].湖北工业大学学报,2013,28(1):60-63. 被引量：1
6京瓷发布TASKalfa 3010i机型[J].中国包装,2014,34(9):95-95.
7顾洪博,张继怀.不确定性数据的聚类分析研究及应用[J].河北工程大学学报（自然科学版）,2012,29(1):109-112. 被引量：1
8孙焕良,邱菲,刘俊岭,朱叶丽.IncSNN——一种基于密度的增量聚类算法[J].计算机研究与发展,2006,43(z3):309-313. 被引量：5
9曾泽林,段明秀.基于密度的聚类算法DBSCAN的研究与实现[J].科技信息,2012(30):163-163. 被引量：3
10王建忠.近年VFP考试考点分布概况[J].电脑知识与技术（认证考试）,2004(02M):28-28.

计算机工程与设计

2011年第8期

浏览历史

内容加载中请稍等...

分布式数据流聚类算法被引量：2

参考文献10

二级参考文献22

共引文献60

同被引文献19

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

分布式数据流聚类算法 被引量：2

参考文献10

二级参考文献22

共引文献60

同被引文献19

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

分布式数据流聚类算法被引量：2