基于nested-loop的大数据集快速离群点检测算法被引量：1

Efficient nested-loop based outlier detection algorithm for large data set

下载PDF

导出

摘要针对已有的多数离群点检测算法存在扩展性差,不能有效应用于大数据集的问题,在已有的基于距离的离群点检测算法的基础上,设计模信息表存储结构,利用向量内积不等式关系以及合理的存储分配和调度策略,提出一种高效离群点检测算法DBoda.该算法通过在预处理中存储每个点的模信息,减少点间距离的计算量,并对嵌套循环方法进行优化,进一步减少I/O的开销.理论分析和试验结果表明,所提算法具有时间消耗小和适用于处理大数据集的特点,可以有效地解决离群点检测中的算法时间复杂性和算法扩展性问题. Most of the existed outlier detection algorithms have the limitation in algorithms＇ expansibility, and cannot be used efficiently for the large data set. To solve this problem, mode storage structure and vectors＇ inner product inequation are designed, the suitable storage allocating method and the I/O strategy are adopted. Furthermore, based on the existed distance-based outlier detection algorithm, an efficient nested-based outlier detection algorithm DBoda is proposed, which is suitable for the large data set. Two strategies are adopted in the algorithm. Firstly, during the pretreatment process, each data point＇s mode information is stored to reduce the computation work. Secondly, optimization is adopted in the nested loop step to reduce I/O. Theoretical analysis and experiment results testify that DBoda is efficient and suitable to deal with large data set. It can solve the time complexity and expansibility problem of outlier detection algorithms.

作者倪巍伟陈耿陆介平孙志挥

机构地区东南大学计算机科学与工程学院南京审计学院审计信息工程重点实验室

出处《东南大学学报（自然科学版）》 EI CAS CSCD 北大核心 2006年第3期463-466,共4页 Journal of Southeast University：Natural Science Edition

基金国家自然科学基金资助项目(70371015) 高等学校博士学科点专项科研基金资助项目(20040286009) 审计署审计科研所专项资助项目(SK2006007)

关键词大数据集模信息表向量内积不等式离群点检测 large data set mode table vectors＇ inner product inequation outlier detection

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献1

1李存华,孙志挥.一类数据空间网格化聚类算法的均值近似方法(英文)[J].软件学报,2003,14(7):1267-1274. 被引量：15

二级参考文献11

1Sheikholeslami G, Chatterjee S, Zhang A. Wave-Cluster: A multi-resolution clustering approach for very large spatial databases. In:Proceedings of the 24th International Conference on Very Large Databases. New York, 1998. 428~439.
2Aggrawal R, Gehrke J, Gunopulos D, Raghawan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. Seattle, WA, 1998.94~ 105.
3Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Databases. Athens, Greece, 1997.186~ 195.
4Hinneburg A, Keim DA. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'98). New York, 1998.58~65.
5Xing EP, Karp RM. CLIFF: Clustering of high dimensional microarray data via iterative feature filtering using normalized cuts.BIOINFORMATICS, 2001,1(1):1~9.
6Hinneburg A, Keim DA, Brandt W. Clustering 3D-structures of small amino acid chains for detecting dependences from their sequential context in proteins. In: Proceedings of the IEEE International Symposium on BioInformatics and Biomedical Engineering. Washington, DC, 2000. 43-49.
7Xu X, Ester M, Kriegel H, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the 14th International Conference on Data Engineering, ICDE'98. Orlando, FL, 1998. 324~331.
8Silverman B. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.72~113.
9Han J, Kamber M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.335~398.
10Berchtold S, Keim D, Kriegel HP. The X-tree: An index structure for high-dimensional data. In: Proceedings of the International Conference on Very Large Databases. Bombay, India, 1996.28~39.

共引文献14

1李存华,孙志挥,陈耿,胡云.核密度估计及其在聚类算法构造中的应用[J].计算机研究与发展,2004,41(10):1712-1719. 被引量：60
2苏守宝,郁书好.一种基于密度的增量式网格聚类算法[J].皖西学院学报,2004,20(5):91-94.
3张莹,韩芳溪,柴乔林.基于频繁模式树的AOI聚类算法[J].计算机工程与应用,2004,40(35):178-179.
4王博,迟忠先,岳训.一种面向GIS系统的新型双层聚类方法[J].计算机工程,2006,32(7):84-85. 被引量：2
5张光建,黄贤英.基于最小聚类单元的聚类算法研究及其在CRM中的应用[J].计算机科学,2006,33(7):188-189. 被引量：11
6XIE Conghua,SONG Yuqing,CHANG Jinyi.A New Method of Semantic Feature Extraction for Medical Images Data[J].Wuhan University Journal of Natural Sciences,2006,11(5):1152-1156.
7单世民,邓贵仕,何英昊.一种基于网格和密度的微粒群混合聚类算法[J].计算机科学,2006,33(11):164-165. 被引量：3
8谢从华,陆虎,薛万宇,宋余庆.基于动态步长的医学图像聚类分割研究[J].微电子学与计算机,2007,24(4):66-68. 被引量：1
9李存华,纪兆辉,胡云.分箱核密度估计的误差及其修正[J].数据采集与处理,2009,24(2):212-217. 被引量：1
10孟建良,程伟想,牛为华.基于网格距离的高精度聚类算法[J].计算机应用与软件,2009,26(6):262-264. 被引量：4

同被引文献8

1陈燕,耿国华,郑建国.一种改进的基于密度的聚类算法[J].微机发展,2005,15(3):17-19. 被引量：13
2朱倩,黄志军.一种改进的基于密度和网格的高维聚类算法[J].舰船电子工程,2005,25(5):55-56. 被引量：5
3PAPADIMITRIOU S,KITAGAWA H,GIBBONS PB.LOCI:fast outlier detection using the local correlation integral[A].Proceedings of the 19th International Conference on Data Engineering[C].IEEE Computer Society,2003.315 -326.
4RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[J].ACM Sigmoid Record,2000,29(2):427-438.
5KNORR E,NG R.Algorithms for mining distance-based outliers in large datasets[A].ASHISH G,ed.Proceedings of the 24th Coaference on VLDB[C].New York,1998.392 -403.
6CHEN Z,TANG J,FU A.Modeling and efficient mining of intentional knowledge of outliers[A].Proceedings of the 7th International Database Engineering and Applications Symposium[C].IEEE Computer Society,2003.1-10.
7周水庚,周傲英,曹晶,胡运发.一种基于密度的快速聚类算法[J].计算机研究与发展,2000,37(11):1287-1292. 被引量：88
8郑斌祥,席裕庚,杜秀华.基于离群指数的时序数据离群挖掘[J].自动化学报,2004,30(1):70-77. 被引量：15

引证文献1

1崔贯勋,朱庆生.一种改进的基于密度的离群数据挖掘算法[J].计算机应用,2007,27(3):559-560. 被引量：8

二级引证文献8

1崔贯勋,李梁,王勇,倪伟,黄丽丰.快速的基于单元格的离群数据挖掘算法[J].计算机应用,2009,29(12):3300-3302. 被引量：8
2项响琴,汪彩梅.基于聚类高维空间算法的离群数据挖掘技术研究[J].计算机技术与发展,2010,20(1):124-127. 被引量：5
3张卫旭,尉宇.基于密度的局部离群点检测算法[J].计算机与数字工程,2010,38(10):11-14. 被引量：12
4张倩,薛安荣.基于密度的分布式隐私保护异常检测算法[J].计算机工程与设计,2010,31(23):4960-4962.
5闫少华,张巍,滕少华.基于密度的离群点挖掘在入侵检测中的应用[J].计算机工程,2011,37(18):240-242. 被引量：5
6王美晶,叶东毅.改进的基于PSO的离群点检测算法[J].计算机应用,2012,32(A01):139-143. 被引量：1
7夏火松,魏翔.基于高效离群数据分析方法的客户信息及特征属性挖掘[J].统计与决策,2012,28(19):47-51. 被引量：1
8苗永春,程艳.离群点检测方法及其在大数据时代下的改进方法研究[J].江西师范大学学报（自然科学版）,2014,38(5):454-458. 被引量：5

1赵学良,朱庆生.基于距离的数据流离群点快速检测[J].世界科技研究与发展,2013,35(4):462-464. 被引量：4
2倪巍伟,陆介平,孙志挥.基于向量内积不等式的分布式k均值聚类算法[J].计算机研究与发展,2005,42(9):1493-1497. 被引量：15
3张波良,周水庚,关佶红.MapReduce框架下的Skyline计算[J].计算机科学与探索,2011,5(5):385-397. 被引量：17
4杨晓波.算法时间复杂性分析综述[J].西藏大学学报（社会科学版）,2011,26(2):87-90. 被引量：4
5娄圣金,张继福,刘爱琴.一种基于p权值的离群数据挖掘算法[J].小型微型计算机系统,2014,35(1):55-59. 被引量：6
6史东辉,张春阳,蔡庆生.离群数据的挖掘方法研究[J].小型微型计算机系统,2001,22(10):1234-1236. 被引量：16
7俞琳琳,吉根林.离群数据挖掘方法研究[J].信息技术,2005,29(11):86-89. 被引量：1
8LIU Xinghua ZHAO Chengyong PENG Maolan GUO Chunyi ZHANG Baoshun.Nested-loop Mechanism Based Modular Multilevel Converter Topology and Optimal Design[J].中国电机工程学报,2013,33(9). 被引量：7
9彭黎峻.Turbo C中的链表[J].电脑爱好者,1998(17):47-49.
10范光辉.擂台赛[J].电脑爱好者,1995,0(2):43-43.

东南大学学报（自然科学版）

2006年第3期

浏览历史

内容加载中请稍等...

基于nested-loop的大数据集快速离群点检测算法被引量：1

参考文献1

二级参考文献11

共引文献14

同被引文献8

引证文献1

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于nested-loop的大数据集快速离群点检测算法 被引量：1

参考文献1

二级参考文献11

共引文献14

同被引文献8

引证文献1

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于nested-loop的大数据集快速离群点检测算法被引量：1