摘要
提出一种检测离散属性数据集中相对离群点的算法.目前已有的关于离群点的检测方法大多关注连续属性的数据集,由于离散属性值之间并没有类似于连续属性值之间那样固有的距离度量关系,故不能简单的把用于连续属性数据集的检测算法应用到离散属性数据集中来.本文首先引入了一种新的信息熵增量的概念——去一划分信息熵增量,通过形式化分析得到了其性质.然后,在去一划分信息熵增量的基础上,给出了每个对象所对应的相对离点群因子(ROF)的定义.每个对象的ROF是相对的,因为其只取决于这一对象的邻域.接着,提出了ENBROD算法来实现对ROF的计算.最后,通过实验说明当邻域大小较小时,ENBROD算法可以找到已存在的方法所找不到的相对离群点;而当邻域的大小足够大时,ENBROD算法寻找全局离群点的能力也与其他的一些离群点检测算法的能力相近.
In outlier detection many definitions of outlier take a global view of the dataset and these outliers can be viewed as “global” outliers. However, for many interesting real world data sets which exhibit a more complex structure, there is another kind of outlier. This can be objects that are outlying relative to their local neighborhoods, particularly with respect to the densities of the neighhorhoods. These outliers are regarded as “relative” outliers. An entropy-based algorithm is presented to detect relative outliers in data set with categorical attributes in this paper. After introducing a new information gain named leave-one partition information gain, this paper defines an outlier factor called Relative Outlier Factor(ROF) for each object. The outlier factor is relative in the sense that only a restricted neighborhood of each object is taken into account, then the ROFs of two classic discrete data sets are shown to demonstrate the validity of ROF. Furthermore, this paper provides the algorithm ENBROD(ENtropy- Based Relative Outlier Detector) to compute ROFs for each object and the time complexity of ENBROD is discussed in details. In the experimental part, the analysis of experiments on the zoo data set demonstrates the outliersdetected by ENBROD are meaningful in practice. The results on the Winsconsin breast cancer data set demonstrate that the ability of ENBROD to find global outliers is similar with that of several other existing algorithms when the size of neighborhood is large enough. Furthermore, ENBROD is able to find other outliers other algorithms are blind to when the size of neighborhood is smaller.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2008年第2期212-218,共7页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(60503022)
关键词
离群点
离散属性
信息熵
outlier, categorical attribute, entropy