期刊文献+

一种基于划分的孤立点检测算法 被引量:16

An Algorithm Based on Partition for Outlier Detection
下载PDF
导出
摘要 孤立点是不具备数据一般特性的数据对象.划分的方法是通过将数据集中的数据点分布的空间划分为不相交的超矩形单元集合,匹配数据对象到单元中,然后通过各个单元的统计信息来发现孤立点.由于大多真实数据集具有较大偏斜,因此划分后会产生影响算法性能的大量空单元.由此,提出了一种新的索引结构——CD-Tree(celldimensiontree),用于索引非空单元.为了优化CD-Tree结构和指导对数据的划分,提出了基于划分的数据偏斜度(skewofdata,简称SOD)概念.基于CD-Tree与SOD,设计了新的孤立点检测算法.实验结果表明,该算法与基于单元的算法相比,在效率及有效处理的维数方面均有显著提高. Outliers are objects that do not comply with the general behavior of the data. The method of partition divides data space into a set of non-overlapping rectangular cells by partitioning every dimension into equal length. Statistical information of cells is used to find knowledge in datasets, There exists very large data skew in real-life datasets, so partition will produce many empty cells, which affects the efficiency of the algorithms. An efficient index structure called CD-Tree (cell dimension tree) is designed for indexing cells, Moreover, to guide partition and to optimize the structure of CD-Tree, the concept of SOD (skew of data) is proposed to measure the degree of data skew. Finally, the CD-Tree-based algorithm is designed for outlier detection based on CD-Tree and SOD. The experimental results show that the efficiency of CD-Tree-based algorithm and the maximum number of dimensions processed increase obviously comparing with the Cell-based algorithm on real-life datasets.
出处 《软件学报》 EI CSCD 北大核心 2006年第5期1009-1016,共8页 Journal of Software
基金 国家自然科学基金 国家教育部高等学校优秀青年教师教学和科研奖励基金 辽宁省自然科学基金 辽宁省教育厅攻关计划基金~~
关键词 数据挖掘 孤立点检测 划分 CD-Tree(cell DIMENSION tree) 基于单元的算法 data mining outlier detection partition CD-tree (cell dimension tree) cell-based algorithm
  • 相关文献

参考文献8

  • 1Knorr E,Ng R.Algorithms for mining distance-based outliers in large data sets.In:Gupta A,Shmueli O,Widom J,eds.Proc.of the VLDB Conf.New York:Morgan Kaufmann Publishers,1998.392-403.
  • 2Knorr E,Ng R.Finding intensional knowledge of distance-based outliers.In:Atkinson MP,Orlowska ME,Valduriez P,Zdonik SB,Brodie ML,eds.Proc.of the VLDB Conf.Edinburgh:Morgan Kaufmann Publishers,1999.211-222.
  • 3Ramaswamy S,Rastogi R,Shim K.Efficient algorithms for mining outliers from large data sets.In:Chen WD,Naughton JF,Bernstein PA,eds.Proc.of the ACM SIGMOD Conf.Dallas:ACM Press,2000.427-438.
  • 4Breunig MM,Kriegel HP,Ng R,Sander J.LOF:Identifying density-based local outliers.In:Chen WD,Naughton JF,Bernstein PA,eds.Proc.of the ACM SIGMOD Conf.Dallas:ACM Press,2000.94-104.
  • 5Arning A,Agrawal R,Raghavan P.A linear method for deviation detection in large databases.In:Simoudis E,Han JW,Fayyad UM,eds.Proc.of the KDD Conf.Portland:AAAI Press,1996.164-169.
  • 6Beckmann N,Kriegel HP,Schneider R,Seeger B.The R*-tree:An efficient and robust access method for points and rectangles.In:Hector GM,Jagadish HV,eds.Proc.of the ACM SIGMOD Conf.Atlantic:ACM Press,1990.322-331.
  • 7Katayama N,Satoh S.The SR-tree:An index structure for high-dimensional nearest neighbor queries.In:Peckham J,ed.Proc.of the ACM SIGMOD Conf.Tucson:ACM Press,1997.369-380.
  • 8Berchtold S,Keim DA,Kriegel H.The X-tree:An index structure for high-dimensional data.In:Vijayaraman TM,Buchmann AP,Mohan C,Sarda NL,eds.Proc.of the 22nd VLDB Conf.Bombay:Morgan Kaufmann Publishers,1996.28-39.

同被引文献117

引证文献16

二级引证文献81

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部