摘要
针对现有基于距离的离群点检测算法在处理大规模数据时效率低的问题,提出一种基于聚类和索引的分布式离群点检测(DODCI)算法。首先利用聚类方法将大数据集划分成簇;然后在分布式环境中的各节点处并行创建各个簇的索引;最后使用两个优化策略和两条剪枝规则以循环的方式在各节点处进行离群点检测。在合成数据集和整理后的KDD CUP数据集上的实验结果显示,在数据量较大时该算法比Orca和iDOoR算法快近一个数量级。理论和实验分析表明,该算法可以有效提高大规模数据中离群点的检测效率。
The existing distance-based outlier detection algorithms suffer from low efficiency when dealing with large-scale datasets. To relieve this problem, a distributed outlier detection algorithm based on clustering and indexing (DODCI) was presented. The algorithm partitioned the original dataset into clusters by employing a certain clustering method. Then the index of each cluster was built in parallel on each distributed node. Afterwards, detection of outliers was implemented on each node looply using two optimization strategies and two pruning rules. The experimental results on synthetic dataset and preprocessed KDD CUP datasets show that the proposed algorithm is almost up to an order-of-magnitude faster than the two existing algorithms (Orca and iDOoR) when the dataset is large enough. The theoretical and experimental analyses show that the proposed algorithm can effectively raise the speed of outlier detection in large-scale datasets.
出处
《计算机应用》
CSCD
北大核心
2013年第11期3057-3061,共5页
journal of Computer Applications
关键词
离群点
聚类
索引
分布式
优化策略
剪枝规则
outlier
clustering
index
distributed
optimization strategy
pruning rule