摘要
基于计数的频繁项挖掘算法是目前数据流频繁项挖掘领域非常活跃的一种方法.在SS计数算法的启发下,针对网络流的实际特性,提出了一种剪枝操作受时间和流长双重约束的网络流频繁项挖掘算法TSFIM.算法采用三级缓存结构分别实现长流及时保护、基于时间的报文归并和基于流长的流项区分淘汰;通过理论分析了TSFIM算法的性能并探讨了算法适用于长时间情况下的约束条件和优势;最后通过实际流量数据测试表明,TSFIM算法具有非常高的空间利用率,算法在流频繁项提取、流长统计效果上明显优于SS等算法.
The counting method is an important frequent items mining algorithm in data streams. Inspired by the SS counting method and based on the property of flows, a frequent items mining algorithm over network flows TSFIM was proposed, whose pruning strategy was subject to the constraints of time and flow length. The algorithm TSFIM adopted three-stage buffers to fulfill the functions of preserving large flows promptly, packets aggregation based on time and flow deletion based on length respectively. The performance of TSFIM was analyzed in theory, then the constraint and advantage of TSFIM at the long time situation were discussed. Finally, an experiment on real network traffic shows that TSFIM is very space-saving and much more accurate than other algorithms like SS in the metrics of frequent items mining and flow length counting.
基金
陕西省自然科学基金重点项目(2012JZ8005)资助
关键词
网络流
频繁项挖掘
计数算法
剪枝操作
network flows
frequent items mining
counting method
pruning strategy