摘要
随着数据采集和生成技术的不断成熟,能够生成数据流的应用越来越多,近些年,网络应用进一步普及,单一数据流的应用向着多节点的分布式数据流方向转移,如传感器网络、网络监控、WEB日志以及多站点的信用卡交易数据。这些数据不仅具有实时、连续、规模大的特点,还具有分布式的特征,如何管理和分析大规模的分布式的动态数据集,是研究人员面临的重要课题。针对这种现状,本文给出了同构分布式数据流和异构分布式数据流的形式化描述,分析了集中式流处理架构与分布式流处理架构的优势与不足,讨论了分布式数据流分类算法的最新进展,归纳了分布式数据流挖掘面临的问题和挑战,以及未来可能的研究方向。
With advances in data collection and generation technologies, environments that produce data streams is more and more. In recent years, the network application is further universal and the applications of a single data stream transfer toward a multi -node distributed data streams, such as sensor network, network monitoring, web log analysis and the credit card transaction data of multiple sites. These data is not only real - time, continuous and large scale, but also distributed. How to manage and analyze large dynamic datasets is an important subject that researchers are faced with. In view of the situation, it presented the formalization description of homogeneous and heterogeneous distributed data stream in this paper, analyzed advantages and disadvantages of the centralized stream processing architecture and distributed streaming architecture, discussed the recent progress in distributed data stream classification algorithm, summed up the problems and challenges faced by the distributed data stream mining, and possible future research directions.
出处
《华北科技学院学报》
2015年第4期119-124,共6页
Journal of North China Institute of Science and Technology
基金
中央高校基本科研业务费资助(3142014096
3142014087
3142014125
3142013098)
关键词
分布式数据流
数据挖掘
分类
Distributed data streams
Data mining
Classification