摘要
针对扩展隔离林(EIF)算法时间开销过大的问题,提出了一种基于随机子空间的扩展隔离林(RS-EIF)算法。首先,在原数据空间确定多个随机子空间;然后,在不同的随机子空间中通过计算每个节点的截距向量与斜率来构建扩展孤立树,并将多棵扩展孤立树集成为子空间扩展隔离林;最后,通过计算数据点在扩展隔离林中的平均遍历深度来确定数据点是否异常。在离群值检测数据库(ODDS)中的9个真实数据集与呈多元分布的7个人工数据集上的实验结果表明,所提RS-EIF算法对局部异常很敏感,相较EIF算法减少了约60%的时间开销;在样本数量较多的ODDS数据集上,该算法识别精度高出孤立森林(iForest)算法、轻型在线异常检测(LODA)算法和基于连接函数的异常检测(COPOD)算法2~12个百分点。RS-EIF算法在样本数量大的数据集中识别效率更高。
Aiming at the problem of excessive time overhead of the Extended Isolation Forest(EIF)algorithm,a new algorithm named Extended Isolation Forest based on Random Subspace(RS-EIF)was proposed.Firstly,multiple random subspaces were determined in the original data space.Then,in each random subspace,the extended isolated tree was constructed by calculating the intercept vector and slope of each node,and multiple extended isolated trees were integrated into a subspace extended isolation forest.Finally,the average traversal depth of data point in the extended isolation forest was calculated to determine whether the data point was abnormal.Experimental results on 9 real datasets in Outliter Detection DataSet(ODDS)and 7 synthetic datasets with multivariate distribution show that,the RS-EIF algorithm is sensitive to local anomalies and reduces the time overhead by about 60%compared with the EIF algorithm;on the ODDS datasets with many samples,its recognition accuracy is 2 percentage points to 12 percentage points higher than those of the isolation Forest(iForest)algorithm,Lightweight On-line Detection of Anomalies(LODA)algorithm and COPula-based Outlier Detection(COPOD)algorithm.The RS-EIF algorithm has the higher recognition efficiency in the dataset with a large number of samples.
作者
谢雨
蒋瑜
龙超奇
XIE Yu;JIANG Yu;LONG Chaoqi(School of Software Engineering,Chengdu University of Information Technology,Chengdu Sichuan 610225,China)
出处
《计算机应用》
CSCD
北大核心
2021年第6期1679-1685,共7页
journal of Computer Applications
关键词
异常检测
随机子空间
扩展隔离林算法
扩展孤立树
平均遍历深度
anomaly detection
random subspace
Extended Isolation Forest(EIF)algorithm
extended isolated tree
average traversal depth