期刊文献+

基于信息熵的相对离群点的检测方法:ENBROD 被引量:11

An entropy-based algorithm to detect relative outliers:ENBROD
下载PDF
导出
摘要 提出一种检测离散属性数据集中相对离群点的算法.目前已有的关于离群点的检测方法大多关注连续属性的数据集,由于离散属性值之间并没有类似于连续属性值之间那样固有的距离度量关系,故不能简单的把用于连续属性数据集的检测算法应用到离散属性数据集中来.本文首先引入了一种新的信息熵增量的概念——去一划分信息熵增量,通过形式化分析得到了其性质.然后,在去一划分信息熵增量的基础上,给出了每个对象所对应的相对离点群因子(ROF)的定义.每个对象的ROF是相对的,因为其只取决于这一对象的邻域.接着,提出了ENBROD算法来实现对ROF的计算.最后,通过实验说明当邻域大小较小时,ENBROD算法可以找到已存在的方法所找不到的相对离群点;而当邻域的大小足够大时,ENBROD算法寻找全局离群点的能力也与其他的一些离群点检测算法的能力相近. In outlier detection many definitions of outlier take a global view of the dataset and these outliers can be viewed as “global” outliers. However, for many interesting real world data sets which exhibit a more complex structure, there is another kind of outlier. This can be objects that are outlying relative to their local neighborhoods, particularly with respect to the densities of the neighhorhoods. These outliers are regarded as “relative” outliers. An entropy-based algorithm is presented to detect relative outliers in data set with categorical attributes in this paper. After introducing a new information gain named leave-one partition information gain, this paper defines an outlier factor called Relative Outlier Factor(ROF) for each object. The outlier factor is relative in the sense that only a restricted neighborhood of each object is taken into account, then the ROFs of two classic discrete data sets are shown to demonstrate the validity of ROF. Furthermore, this paper provides the algorithm ENBROD(ENtropy- Based Relative Outlier Detector) to compute ROFs for each object and the time complexity of ENBROD is discussed in details. In the experimental part, the analysis of experiments on the zoo data set demonstrates the outliersdetected by ENBROD are meaningful in practice. The results on the Winsconsin breast cancer data set demonstrate that the ability of ENBROD to find global outliers is similar with that of several other existing algorithms when the size of neighborhood is large enough. Furthermore, ENBROD is able to find other outliers other algorithms are blind to when the size of neighborhood is smaller.
作者 于绍越 商琳
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2008年第2期212-218,共7页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(60503022)
关键词 离群点 离散属性 信息熵 outlier, categorical attribute, entropy
  • 相关文献

参考文献13

  • 1Edwin M K, Raymond T N, Vladimir T. Distance-based outliers: Algorithms and applications. VLDB Journal, 2000,8(3-4):237-253.
  • 2Chen K, Liu L. The "Best K" for entropy-based categorical data clustering. Proceedings of the 17th International Conference on Scientific and Statistical Database Management, 2005, 253-262.
  • 3Stephen D B, Mark S. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proceedings of 9th Annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 29-38.
  • 4刘君强,王勋,孙晓莹.多维多层关联规则有效挖掘的新算法[J].南京大学学报(自然科学版),2003,39(2):205-210. 被引量:9
  • 5Barbara D, Li Y, Couto J. Coolcat: An entropy-based algorithm for categorical clustering. Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2002, 582-589.
  • 6Li T, Ma S, Mitsunori O. Entropy-based criterion in categorical clustering. Proceedings of Internal Conference on Machine Learning (ICML), 2004.
  • 7He Z Y, Xu X F, Deng S C. An optimization model for outlier detection in categorical data. Proceedings, Part Ⅰ. Lecture Notes in Computer Science of Advances in Intelligent Computing, International Conference on Intelligent Computing, 2005, 23-26.
  • 8Markus M B, Hans-Peter K, Raymond T N, et al. LOF: Identifying density-based local outliers. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000,93-104.
  • 9Claude E S. A mathematical theory of communication. Bell System Techical Journal, 1948 (27): 379-423, 623-656.
  • 10Edwin M K. Outliers and data mining: Finding exceptions in data Ph. Do Thesis. The University of British Columbia, 2002.

二级参考文献10

  • 1Ganti V, Gehrke J, Ramakrishnan R. Mining very large databases. Computer, 1999, 32(8) : 38-45.
  • 2Han J, Fu Y. Discovery of multiple-level association rules from large databases. IEEE Transactions on Knowledge and Data Engineering, 1999, 11(5) : 798-805.
  • 3Srikant R, Agrawal R. Mining generalized association rules. Umeshwar D, Peter M D G, Shojiro N.Proceedings of the 21st Intonational Conference on Very Large Data Bases. San Francisieo: Morgan Kaufmann Publishers Inc, 1995: 407-419.
  • 4Agrawal R, Srikant R. Fast algorithms for mining association rules. Jorge B B, Matthias J, Carlo Z.Proceedings of the 20th International Conference on Very Large Data Bases. San Francisico: Morgan Kaufmann Publishers Inc, 1994: 487-499.
  • 5Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases.Peter B, Suslail J. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. ACM Press, 1993: 207-216.
  • 6Miller R J, Yang Y. Association rules over interval data. Joan P. Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. ACM Press, 1997: 452-461.
  • 7Srikant R, Agrawal R. Mining quantitative association rules in large relational tables. Jagadish H V,Inderpal S M. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.ACM Press, 1996: 1-12.
  • 8Department of Information and Computer Science, University of California at Irvine. UCI machine learning repository, http://www. ics. uci. edu/-mlearn/MLRepository. html, 2000.
  • 9Christian B. Apriori implementation. http://fuzzy. cs. uni-magdeburg. de/-borgelt/src/apriori.exe,2000.
  • 10邹翔,张巍,蔡庆生,王清毅.大型数据库中的高效序列模式增量式更新算法[J].南京大学学报(自然科学版),2003,39(2):165-171. 被引量:10

共引文献8

同被引文献104

引证文献11

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部