期刊文献+

加速大规模数据集的离群点检测

Speeding up outlier detection in large-scale datasets
下载PDF
导出
摘要 针对现有基于距离的离群点检测算法在处理大规模数据时效率低的问题,提出一种基于聚类和索引的分布式离群点检测(DODCI)算法。首先利用聚类方法将大数据集划分成簇;然后在分布式环境中的各节点处并行创建各个簇的索引;最后使用两个优化策略和两条剪枝规则以循环的方式在各节点处进行离群点检测。在合成数据集和整理后的KDD CUP数据集上的实验结果显示,在数据量较大时该算法比Orca和iDOoR算法快近一个数量级。理论和实验分析表明,该算法可以有效提高大规模数据中离群点的检测效率。 The existing distance-based outlier detection algorithms suffer from low efficiency when dealing with large-scale datasets. To relieve this problem, a distributed outlier detection algorithm based on clustering and indexing (DODCI) was presented. The algorithm partitioned the original dataset into clusters by employing a certain clustering method. Then the index of each cluster was built in parallel on each distributed node. Afterwards, detection of outliers was implemented on each node looply using two optimization strategies and two pruning rules. The experimental results on synthetic dataset and preprocessed KDD CUP datasets show that the proposed algorithm is almost up to an order-of-magnitude faster than the two existing algorithms (Orca and iDOoR) when the dataset is large enough. The theoretical and experimental analyses show that the proposed algorithm can effectively raise the speed of outlier detection in large-scale datasets.
出处 《计算机应用》 CSCD 北大核心 2013年第11期3057-3061,共5页 journal of Computer Applications
关键词 离群点 聚类 索引 分布式 优化策略 剪枝规则 outlier clustering index distributed optimization strategy pruning rule
  • 相关文献

参考文献14

  • 1KNORR E, NG R T. Algorithms for mining distance-based oudiers in large datasets [ C]// Proceedings of the 24th Very Large Data Base Conference. New York: VLDB Press, 1998:392 -403.
  • 2RAMASWAMY S, RASTOGI R, SHIM K. Efficient algorithms for mining outliers from large data sets [ C]/! Proceedings of the ACM SIGMOD Conference on Management of Data. New York: ACM Press, 2000:427-438.
  • 3BAY D S, SCHWABACHER M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule [ C]/! Proceedings of the Ninth ACM SIGKDD on Knowledge Discovery and Data Mining. New York: ACM Press, 2003:29 -38.
  • 4BHADURI K, MATI'HEWS B, GIANNELLA C R. Algorithms for speeding up distance-based outlier detection [ C]/! Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. New York: ACM Press, 2011:859 -867.
  • 5LOZANO E, ACUFIA E. Parallel algorithms for distance-based and density-based outliers [ C] // Proceedings of the 2005 IEEE Interna- tional Conference on Data Mining. Washington, DC: IEEE Comput- er Society, 2005:729 -732.
  • 6VU N H, GOPALKRISHNAN V. Efficient pruning schemes for dis- tance-based outlier detection [ C]//Proceedings of the 2009 Europe- an Conference on Machine Learning and Knowledge Discovery in Da- tabases. Berlin: Spring-Verlag, 2009:160-175.
  • 7OTEY M. E, GHOTING A, PARTHASARATHY S. Fast distribu- ted outlier detection in mixed-attribute data sets [ J]. Data Mining and Knowledge Discovery, 2006, 12(2/3): 203 -228.
  • 8ANGIULLI F, BASTA S, LOD! S, et al. Distributed strategies for mining outliers in large data sets [ J]. IEEE Transactions on Knowl- edge and Data Engineering, 2012, 25(7) : 1520 - 1532.
  • 9ANGIULLI F, BASTA S, LODI S, et al. A distributed approach to detect outliers in very large data sets [ C]/! Proceedings of the 16th International Euro-Par Conference on Parallel Processing. Berlin: Springer-Verlag, 2010:329-340.
  • 10KNORR E M, NG R T. Finding intensional knowledge of distance- based oatliers [ C] // Proceedings of the 25th International Confer- ence on Very Large Data Bases. San Francisco: Morgan Kauf- mann, 1999:211-222.

二级参考文献78

  • 1文俊浩,吴中福,吴红艳.空间孤立点检测[J].计算机科学,2006,33(5):186-187. 被引量:5
  • 2杨宜东,孙志挥,朱玉全,杨明,张柏礼.基于动态网格的数据流离群点快速检测算法[J].软件学报,2006,17(8):1796-1803. 被引量:22
  • 3汪加才,张金城,江效尧.一种有效的可视化孤立点发现与预测新途径[J].计算机科学,2007,34(6):200-203. 被引量:5
  • 4薛安荣,鞠时光.基于空间约束的离群点挖掘[J].计算机科学,2007,34(6):207-209. 被引量:12
  • 5赵科平,周水庚,关佶红,等.一种新的离群数据对象发现方法∥中国人工智能学会第10届全国学术年会论文集.北京:北京邮电大学出版社,2003.
  • 6Aggarwal C C, Yu P. Outlier detection for high dimensional dataft Proc. of the ACM SIGMOD International Conference on Management of Data. Santa Barbara, 2001:37-47
  • 7Angiulli F, Pizzuti C. Outlier Mining in Large High Dimensional Data Sets. IEEE Trans. Knowledge and Data Eng. , 2005, 2 (17) :203-215
  • 8Angiulli F, Basta S, Pizzuti C. Distance-based detection and prediction of outlier. IEEE Trans. Knowledge and Data Eng. , 2006, 2(18): 145-160
  • 9Aggarwal C C. Re - designing Distance Functions and Distance - based Applications for High Dimensional Data. SIGMOD Record Date, 2001, 30(1):13-18
  • 10Yu Dantong, Gholamhosein S, Zhang Aidong. FindOut: Finding Outliers in Very Large Datasets. Knowledge and Information Systems, 2002,4 (4) : 387-412

共引文献72

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部