加速大规模数据集的离群点检测

Speeding up outlier detection in large-scale datasets

下载PDF

导出

摘要针对现有基于距离的离群点检测算法在处理大规模数据时效率低的问题,提出一种基于聚类和索引的分布式离群点检测(DODCI)算法。首先利用聚类方法将大数据集划分成簇;然后在分布式环境中的各节点处并行创建各个簇的索引;最后使用两个优化策略和两条剪枝规则以循环的方式在各节点处进行离群点检测。在合成数据集和整理后的KDD CUP数据集上的实验结果显示,在数据量较大时该算法比Orca和iDOoR算法快近一个数量级。理论和实验分析表明,该算法可以有效提高大规模数据中离群点的检测效率。 The existing distance-based outlier detection algorithms suffer from low efficiency when dealing with large-scale datasets. To relieve this problem, a distributed outlier detection algorithm based on clustering and indexing （DODCI） was presented. The algorithm partitioned the original dataset into clusters by employing a certain clustering method. Then the index of each cluster was built in parallel on each distributed node. Afterwards, detection of outliers was implemented on each node looply using two optimization strategies and two pruning rules. The experimental results on synthetic dataset and preprocessed KDD CUP datasets show that the proposed algorithm is almost up to an order-of-magnitude faster than the two existing algorithms （Orca and iDOoR） when the dataset is large enough. The theoretical and experimental analyses show that the proposed algorithm can effectively raise the speed of outlier detection in large-scale datasets.

作者薛安荣闻丹丹刘彬

机构地区江苏大学计算机科学与通信工程学院

出处《计算机应用》 CSCD 北大核心 2013年第11期3057-3061,共5页 journal of Computer Applications

关键词离群点聚类索引分布式优化策略剪枝规则 outlier clustering index distributed optimization strategy pruning rule

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论] TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1KNORR E, NG R T. Algorithms for mining distance-based oudiers in large datasets [ C]// Proceedings of the 24th Very Large Data Base Conference. New York: VLDB Press, 1998:392 -403.
2RAMASWAMY S, RASTOGI R, SHIM K. Efficient algorithms for mining outliers from large data sets [ C]/! Proceedings of the ACM SIGMOD Conference on Management of Data. New York: ACM Press, 2000:427-438.
3BAY D S, SCHWABACHER M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule [ C]/! Proceedings of the Ninth ACM SIGKDD on Knowledge Discovery and Data Mining. New York: ACM Press, 2003:29 -38.
4BHADURI K, MATI'HEWS B, GIANNELLA C R. Algorithms for speeding up distance-based outlier detection [ C]/! Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. New York: ACM Press, 2011:859 -867.
5LOZANO E, ACUFIA E. Parallel algorithms for distance-based and density-based outliers [ C] // Proceedings of the 2005 IEEE Interna- tional Conference on Data Mining. Washington, DC: IEEE Comput- er Society, 2005:729 -732.
6VU N H, GOPALKRISHNAN V. Efficient pruning schemes for dis- tance-based outlier detection [ C]//Proceedings of the 2009 Europe- an Conference on Machine Learning and Knowledge Discovery in Da- tabases. Berlin: Spring-Verlag, 2009:160-175.
7OTEY M. E, GHOTING A, PARTHASARATHY S. Fast distribu- ted outlier detection in mixed-attribute data sets [ J]. Data Mining and Knowledge Discovery, 2006, 12(2/3): 203 -228.
8ANGIULLI F, BASTA S, LOD! S, et al. Distributed strategies for mining outliers in large data sets [ J]. IEEE Transactions on Knowl- edge and Data Engineering, 2012, 25(7) : 1520 - 1532.
9ANGIULLI F, BASTA S, LODI S, et al. A distributed approach to detect outliers in very large data sets [ C]/! Proceedings of the 16th International Euro-Par Conference on Parallel Processing. Berlin: Springer-Verlag, 2010:329-340.
10KNORR E M, NG R T. Finding intensional knowledge of distance- based oatliers [ C] // Proceedings of the 25th International Confer- ence on Very Large Data Bases. San Francisco: Morgan Kauf- mann, 1999:211-222.

二级参考文献78

1文俊浩,吴中福,吴红艳.空间孤立点检测[J].计算机科学,2006,33(5):186-187. 被引量：5
2杨宜东,孙志挥,朱玉全,杨明,张柏礼.基于动态网格的数据流离群点快速检测算法[J].软件学报,2006,17(8):1796-1803. 被引量：22
3汪加才,张金城,江效尧.一种有效的可视化孤立点发现与预测新途径[J].计算机科学,2007,34(6):200-203. 被引量：5
4薛安荣,鞠时光.基于空间约束的离群点挖掘[J].计算机科学,2007,34(6):207-209. 被引量：12
5赵科平,周水庚,关佶红,等.一种新的离群数据对象发现方法∥中国人工智能学会第10届全国学术年会论文集.北京:北京邮电大学出版社,2003.
6Aggarwal C C, Yu P. Outlier detection for high dimensional dataft Proc. of the ACM SIGMOD International Conference on Management of Data. Santa Barbara, 2001:37-47
7Angiulli F, Pizzuti C. Outlier Mining in Large High Dimensional Data Sets. IEEE Trans. Knowledge and Data Eng. , 2005, 2 (17) :203-215
8Angiulli F, Basta S, Pizzuti C. Distance-based detection and prediction of outlier. IEEE Trans. Knowledge and Data Eng. , 2006, 2(18): 145-160
9Aggarwal C C. Re - designing Distance Functions and Distance - based Applications for High Dimensional Data. SIGMOD Record Date, 2001, 30(1):13-18
10Yu Dantong, Gholamhosein S, Zhang Aidong. FindOut: Finding Outliers in Very Large Datasets. Knowledge and Information Systems, 2002,4 (4) : 387-412

共引文献72

1钟诗胜,陕振勇,付旭云,王体春.基于二次指数平滑的发动机气路参数偏差值平滑[J].航空精密制造技术,2012,48(6):26-28. 被引量：1
2蔡超,左小清,陈震霆.一种手机定位数据的非运动数据聚类剔除方法[J].交通信息与安全,2010,28(4):60-63. 被引量：5
3赵战营,成长生.基于聚类分析局部离群点挖掘改进算法的研究与实现[J].计算机应用与软件,2010,27(11):255-258. 被引量：4
4田家瑞,张文政,周颖杰,冯震.骨干通信网络流量告警信息关联分析[J].计算机应用研究,2011,28(1):287-289. 被引量：3
5刘明华,张晋昕.时间序列的异常点诊断方法[J].中国卫生统计,2011,28(4):478-480. 被引量：6
6邓玉洁,朱庆生.基于聚类的离群点分析方法[J].计算机应用研究,2012,29(3):865-868. 被引量：5
7王美晶,叶东毅.改进的基于PSO的离群点检测算法[J].计算机应用,2012,32(A01):139-143. 被引量：1
8唐成龙,邢长征.基于数据分区和网格的离群点挖掘算法[J].计算机应用,2012,32(8):2193-2197. 被引量：2
9卿晓霞,肖丹,王波.能耗实时监测的数据挖掘方法[J].重庆大学学报（自然科学版）,2012,35(7):133-137. 被引量：16
10姜立明,柴瑞敏.基于单元格和属性权重的离群点检测[J].计算机应用与软件,2012,29(10):216-218. 被引量：2

1周屹.不确定对象的反向最近邻查询研究[J].黑龙江工程学院学报,2012,26(4):34-37.
2孙爱程.基于熵距离的离群点检测及其应用[J].无线电工程,2012,42(6):45-47. 被引量：3
3王欣.基于聚类和距离的大数据集离群点检测算法[J].制造业自动化,2011,33(8):101-104. 被引量：5
4王代星,张小平,王翰虎.基于决策树结构特性的后剪枝技术研究[J].电脑与信息技术,2010,18(4):1-4. 被引量：1
5周悦,邢妍妍.基于ODDD水下机器人故障诊断方法[J].计算机测量与控制,2015,23(4):1106-1108.
6ABC.要速度要面子请用Dr.Orca[J].电脑爱好者,2006,0(1):55-55.
7娄圣金,张继福,刘爱琴.一种基于p权值的离群数据挖掘算法[J].小型微型计算机系统,2014,35(1):55-59. 被引量：6
8史东辉,张春阳,蔡庆生.离群数据的挖掘方法研究[J].小型微型计算机系统,2001,22(10):1234-1236. 被引量：16
9俞琳琳,吉根林.离群数据挖掘方法研究[J].信息技术,2005,29(11):86-89. 被引量：1
10张磊,王学慧,窦文华.基于主从支配点的无线自组网络广播算法及优化[J].计算机学报,2006,29(11):1920-1928. 被引量：3

计算机应用

2013年第11期

浏览历史

内容加载中请稍等...

加速大规模数据集的离群点检测

参考文献14

二级参考文献78

共引文献72

相关作者

相关机构

相关主题

浏览历史