基于数据概要描述的分布式数据流聚类模型与算法被引量：4

Clustering Models and Algorithms for Distributed Data Streams Based on Data Synopsis

下载PDF

导出

摘要数据流挖掘可有效解决大容量流式数据的知识发现问题,并已得到广泛研究。数据流的一个典型的例子是传感器采集的流式数据。然而,随着传感器网络的应用普及,这些流式数据在很多情况下是分布式采集和管理的,这就必然导致分布式地挖掘数据流的需求。分布式数据流挖掘的最大障碍是由分布式而导致的挖掘质量或者效率问题。为适应分布式数据流的聚类挖掘,探讨了分布式数据流的挖掘模型,并且基于该模型设计了对应的概要数据结构和关键的挖掘算法,给出了算法的理论评估或者实验验证。实验说明,提出的模型和算法可以有效地减少数据通信代价,并且能保证较高的全局模式的聚类质量。 Mining data streams aims at discovering knowledge from a large of streaming data, in which enough efforts have been done in recent years. As a typical example, the data to be collected by a sensor is a format of data streams. However,in the technical environment of a sensor network, multiple sensors always are set and they collect data in a distributed way, so mining data streams with a distributed way is making a challenge issue. Most ongoing studies for mining distributed data streams are suffering from the problems of accuracy or efficiency. In this paper, the model for clustering a distributed data stream was discussed, including a new synopsis data structure for summarizing data streams and some effective algorithms for key mining phases. The reasons of presented algorithms were also discussed. Experi- mental results demonstrate that presented models and algorithms have less transmission cost and higher clustering qua- lity to mine the global pattern from distributed data streams.

作者毛国君曹永存

机构地区中央财经大学信息学院北京中央民族大学信息工程学院北京

出处《计算机科学》 CSCD 北大核心 2013年第6期187-191,202,共6页 Computer Science

基金国家自然科学基金项目(62173293) 中央财经大学教改项目基金资助

关键词分布式数据流数据概要增量式聚类全局模式 Distributed data stream, Data synopsis, Incremental clustering, Global pattern

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献17

1Babcock B,Babu S,Datar M.Models and issues in data stream systems[C]// Proceedings of the 21 st ACM Symposium on Principles of Database Systems.Madison,WI,USA:ACM,2002:1-16.
2Khalilian M,Mustapha N.Data stream clustering:challenges and issues[C]//Proceedings of 2010 International MultiConference of Engineering and Computer Scientists.Hong Kong,China:Newswood Limited International Association of Engineers,2010:566-569.
3Rajasegarar S,Leckie C,Palaniswami M.Distributed anomaly detection in wireless sensor networks[C]//Proceedings of the 10th IEEE Singapore International Conference on Communications Systems.Singapore,IEEE,2006:1-5.
4Zhang Q,Liu J,Wang W.Approximate clustering on distributed data streams[C]//Proceedings of IEEE 24th International Conference on Data Engineering.Cancun,Mexico:IEEE,2008:1131-1139.
5Graham C,Muthukrishnan S,Zhuang W.Conquering the divide:continuous clustering of distributed data streams[C]//Proceedings of the 23rd International Conference on Data Engineering.Istanbul,Turkey:IEEE,2007:1036-1045.
6Hajiee M.A new distributed clustering algorithm based on Kmeans algorithm[C]//Proceedings of the 3rd International Conference on Advanced Computer Theory and Engineering Piscataway.NJ,USA:IEEE,2010:2408-2411.
7Januzai E,Kriegel H P,Pfeifle M.DBDC:density based distributed clustering[C]//Proceedings of Advances in Database Technology-EDBT 2004 9th International Conference on Extending Database Technology.Berlin,Germany:IEEE,2004:88-105.
8Johnson E,Kargupta H.Collective,Hierarchical clustering from distributed,heterogeneous data[C]//Proceedings of 2000 LargeScale P arallel Data Mining.London,UK:Springer-Verlag,2000:221-244.
9Domingos P,Hulten G.Mining high-speed data streams[C]//Proceedings of KDD-2000 Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston,MA,USA:IEEE,2000:71-80.
10Zhang T,Raghu R,Livny M.BIRCH:an efficient data clustering method for very large databases[J].Sigmod Record,1996,25(2):103-114.

二级参考文献41

1张艳红,吴勇.基于Monte Carlo方法的任意概率密度随机数字信号发生器设计　[J].电子科技,2004,17(8):45-48. 被引量：3
2肖化昆.系统仿真中任意概率分布的伪随机数研究[J].计算机工程与设计,2005,26(1):168-171. 被引量：31
3赵雪峰.一种伪随机数生成算法的研究与实现[J].电脑学习,2005(6):25-26. 被引量：5
4张淑梅,李勇.计算机产生随机数的方法[J].数学通报,2006,45(3):44-45. 被引量：11
5潘云鹤,王金龙,徐从富.数据流频繁模式挖掘研究进展[J].自动化学报,2006,32(4):594-602. 被引量：34
6Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Madison, USA: ACM, 2002. 1-16.
7HanJ,KamberM[著],范明,孟小峰[译].数据挖掘:概念与技术.北京:机械工业出版社,2007.
8Manku G S, Motwani R. Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong, China: Morgan Kanfmann, 2002. 346-357.
9Arasu A, Manku G S. Approximate counts and quantiles over sliding windows. In: Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Paris, France: ACM, 2004. 286-296.
10Cormode G, Muthukrishnan S. Whatrs hot and what's not: tracking most frequent items dynamically. In: Proceedings of the ACM SIGMOD/PODS Conference. San Diego, USA: ACM, 2003.

共引文献60

1刘素华.元宇宙:物的媒介化的消逝与中国制造危机[J].福建论坛（人文社会科学版）,2022(12):54-62.
2汉泽西,甘志强.蒙特卡罗方法在动态测量不确定度分析中的应用[J].计量技术,2009(3):65-68. 被引量：1
3张慎,杜新喜,万金国.随机制作偏差影响下的网架结构性能分析[J].土木建筑与环境工程,2009,31(2):8-12. 被引量：3
4刁树民,王永利.适于数据流组合分类的直推学习方法[J].计算机应用,2009,29(6):1578-1581. 被引量：2
5陈寿文,李明东.Matlab在蚁群聚类算法数据源产生中的应用[J].计算机技术与发展,2009,19(7):216-219. 被引量：2
6钟玉峰,雷国华.一种基于滑动窗口技术的入侵检测方法[J].信息技术,2009,33(7):166-167. 被引量：3
7许孝臣,盛金昌,何淑媛,詹美礼,许明华.防渗帷幕随机缺损的模拟及对坝基渗流的影响[J].河海大学学报（自然科学版）,2009,37(5):582-585. 被引量：9
8关菁华,刘大有.一种挖掘概念漂移数据流的选择性集成算法[J].计算机科学,2010,37(1):204-207. 被引量：5
9文益民,王耀南,张莹.基于可信多数投票的快速概念漂移检测[J].湖南大学学报（自然科学版）,2010,37(6):36-40. 被引量：3
10张健沛,杨显飞,杨静.面向高速数据流的偏倚抽样集合分类器[J].北京邮电大学学报,2010,33(4):44-48. 被引量：2

同被引文献35

1王明珠,王莉华.基于聚类分析的我国各地区综合发展能力评价[J].辽宁石油化工大学学报,2013,33(4):105-108. 被引量：2
2Handl J, Knowles J. An evolutionary approach to multiohjec- tire clustering [J]. IEEE Transactions on Evolutionary Com- putation, 2007, 11 (1): 56-76.
3Saha S, Bandyopadhyay S. A symmetry based multiobjective clustering technique for automatic evolution of clusters [J]. Pattern Recognition, 2010, 43 (3): 738-751.
4Qian Xiaoxue, Zhang Xianrong, Jiao Licheng, et al. Unsu- pervised texture image segmentation using multiobjective evolu- tionary clustering ensemble algorithm [C] //IEEE Congress on Evolutionary Computation. Piscataway, NJ, USA: IEEE, 2008: 3561-3567.
5Zhu Lin, Cao Longbing, Yang Jie. Multiobjective evolutionary algorithm-based soft subspace clustering [C] //IEEE Congress on Evolutionary Computation. NY, USA: IEEE, 2012.
6Strehl A, Ghosh J. Cluster ensembles: A knowledge reuseframework for combining multiple partitions [J]. Journal of Machine Learning Research, 2008, 3 (3): 583-617.
7Deb K, Pratap A, Agarwal S, et al. A fast and elitist mul- tiobjective genetic algorithm: NSGA-II [J]. IEEE Transac- tions on Evolutionary Computation, 2002, 6 (2) : 182-197.
8University of CaliTomia, Irvine. UCI machine learning reposi- tory [EB/OL]. [2013-09- 20]. http://archive, ics. uci. edu/ ml/datasets, html.
9Yang J.Dynamic clustering of evolving streams with a single pass[C].In:Proc.of IEEE International Conference Data Mining(ICDE′09).Washington:IEEE Computer Society,2009:695-697.
10Beringer J,Hullermeier E.Online clustering of parallel data streams[J].Data&Knowledge Engineering,2006,58(2):180-204.

引证文献4

1李莉,李妍琰.基于热点解和差分进化的多目标聚类集成算法[J].计算机工程与设计,2014,35(8):2912-2916. 被引量：2
2唐颖峰,陈世平.一种基于网格块的分布式数据流聚类算法[J].小型微型计算机系统,2016,37(3):488-493. 被引量：4
3李飒,李艳杰.基于同步相关性的多数据流聚类在空气质量评价中的应用[J].辽宁石油化工大学学报,2016,36(2):64-68.
4刘新海,马彦恒,侯建强.基于云和频繁项集的认知测试性诊断方案权衡优化[J].中国测试,2018,44(3):11-15. 被引量：1

二级引证文献7

1李艳玮,郑伟勇.基于神经网络实现分布评估的多目标差分算法[J].计算机工程与设计,2015,36(11):3092-3096. 被引量：2
2高立群.基于差分进化的混合智能优化算法及其节能优化应用[J].煤矿机械,2017,38(10):18-21. 被引量：2
3段汝林,林德丰.基于分布式数据流的网络处理器数据收集分类平台[J].现代电子技术,2019,42(4):117-120. 被引量：2
4张新淼.动态增量式数据流分类挖掘仿真研究[J].计算机仿真,2019,36(5):430-433.
5张健,巨永锋.分布式网络内存动态数据同步模型构建[J].电子设计工程,2021,29(2):159-163. 被引量：2
6唐颖峰,陈世平.分布式数据流处理系统管理中负载均衡问题建模与求解[J].运筹与管理,2021,30(4):155-162. 被引量：3
7吴永旺,饶银辉,庄伟涛,子文江,杨捍东,余蓉,洪晓斌.无人船测试多源数据HDFS存储优化[J].中国测试,2022,48(5):123-127.

1滕明贵,熊范纶,吴正龙.一种对二维空间对象进行聚类的算法[J].模式识别与人工智能,2005,18(3):297-302.
2吴佳,罗可.改进的模糊C均值的增量聚类算法[J].计算机工程与应用,2011,47(23):141-142. 被引量：4
3陈爱国,王士同.基于多代表点的大规模数据模糊聚类算法[J].控制与决策,2016,31(12):2122-2130. 被引量：9
4任宇,张维勇,金麒.基于移动Agent的无线传感器网络管理模型[J].合肥工业大学学报（自然科学版）,2009,32(1):55-57. 被引量：2
5熊焰,金鑫.一种基于Mobile Agent的无线传感器网络数据管理模型[J].信息与控制,2006,35(2):184-188. 被引量：6
6赵翠华,苏锐丹,周利华.基于IBE的安全电子邮件实现[J].商丘师范学院学报,2006,22(5):80-83.
7洪月华,庞超波.传感器网络分布式数据流挖掘研究综述[J].广西经济管理干部学院学报,2015,27(4):33-36.
8童舟,罗可.基于Rough Set带结论域的关联规则挖掘[J].计算机工程与应用,2006,42(25):166-169. 被引量：4
9徐玉辰,刘真,张付志.基于增量式聚类和矩阵分解的鲁棒推荐方法[J].小型微型计算机系统,2015,36(4):689-695. 被引量：3
10王新星.基于Spark平台的热点话题发现算法并行化研究[J].软件导刊,2016,15(9):51-54.

计算机科学

2013年第6期

浏览历史

内容加载中请稍等...

基于数据概要描述的分布式数据流聚类模型与算法被引量：4

参考文献17

二级参考文献41

共引文献60

同被引文献35

引证文献4

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

基于数据概要描述的分布式数据流聚类模型与算法 被引量：4

参考文献17

二级参考文献41

共引文献60

同被引文献35

引证文献4

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

基于数据概要描述的分布式数据流聚类模型与算法被引量：4