摘要
异常检测是近年来数据挖掘中热门的研究课题之一,孤立森林算法是一种高效的无监督的异常检测算法,可以很好地处理高维大规模数据。针对孤立森林算法在计算测试样本的异常值时,计算的是测试样本在孤立森林下的平均路径长度,忽略了孤立二叉树间检测异常能力的差异性以及大规模数据下构建大量孤立二叉树需要耗费大量内存时间这两点不足,提出一种并行化改进孤立森林算法。利用每棵孤立二叉树的路径长度标准差对其进行加权计算异常值,并基于Spark平台实现并行化。通过在公开数据集上进行的对比实验及多种参数配置的并行性能对比实验表明,并行化改进孤立森林算法能够提高异常检测的精确度,同时具有很好的并行性能,能够高效处理需要构建大量孤立二叉树的大规模数据集。
Anomaly detection is one of the hot research topics in data mining in recent years. Isolation Forest algorithm is an efficient unsupervised anomaly detection algorithm that can handle high-dimensional large-scale data well. When Isolation Forest algorithm calculates the outliers of test samples, it calculates the average path length of test samples in Isolation Forest, ignoring the difference in the ability to detect abnormalities between isolation trees and the large amount of memory and time needed to construct a larger number of isolation trees under large-scale data. For these two deficiencies, an improved parallelized Isolation Forest algorithm is proposed. The standard deviation of the path length of each isolation tree is used to weight the outliers, and the parallelization is implemented based on the Spark platform. The comparison experiments on public datasets and parallel performance comparison experiments with multiple parameter configurations show that the proposed algorithm can improve the accuracy of anomaly detection with excellent parallel performance, and can effectively deal with large-scale data sets that need to build a large number of isolation trees.
作者
王诚
狄萱
WANG Cheng;DI Xuan(School of Telecommunications&Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《计算机技术与发展》
2021年第6期13-18,共6页
Computer Technology and Development
基金
江苏省自然科学基金项目(BK20141428)。