基于信息熵的相对离群点的检测方法：ENBROD 被引量：11

An entropy-based algorithm to detect relative outliers:ENBROD

下载PDF

导出

摘要提出一种检测离散属性数据集中相对离群点的算法.目前已有的关于离群点的检测方法大多关注连续属性的数据集,由于离散属性值之间并没有类似于连续属性值之间那样固有的距离度量关系,故不能简单的把用于连续属性数据集的检测算法应用到离散属性数据集中来.本文首先引入了一种新的信息熵增量的概念——去一划分信息熵增量,通过形式化分析得到了其性质.然后,在去一划分信息熵增量的基础上,给出了每个对象所对应的相对离点群因子(ROF)的定义.每个对象的ROF是相对的,因为其只取决于这一对象的邻域.接着,提出了ENBROD算法来实现对ROF的计算.最后,通过实验说明当邻域大小较小时,ENBROD算法可以找到已存在的方法所找不到的相对离群点;而当邻域的大小足够大时,ENBROD算法寻找全局离群点的能力也与其他的一些离群点检测算法的能力相近. In outlier detection many definitions of outlier take a global view of the dataset and these outliers can be viewed as “global” outliers. However, for many interesting real world data sets which exhibit a more complex structure, there is another kind of outlier. This can be objects that are outlying relative to their local neighborhoods, particularly with respect to the densities of the neighhorhoods. These outliers are regarded as “relative” outliers. An entropy-based algorithm is presented to detect relative outliers in data set with categorical attributes in this paper. After introducing a new information gain named leave-one partition information gain, this paper defines an outlier factor called Relative Outlier Factor（ROF） for each object. The outlier factor is relative in the sense that only a restricted neighborhood of each object is taken into account, then the ROFs of two classic discrete data sets are shown to demonstrate the validity of ROF. Furthermore, this paper provides the algorithm ENBROD（ENtropy- Based Relative Outlier Detector） to compute ROFs for each object and the time complexity of ENBROD is discussed in details. In the experimental part, the analysis of experiments on the zoo data set demonstrates the outliersdetected by ENBROD are meaningful in practice. The results on the Winsconsin breast cancer data set demonstrate that the ability of ENBROD to find global outliers is similar with that of several other existing algorithms when the size of neighborhood is large enough. Furthermore, ENBROD is able to find other outliers other algorithms are blind to when the size of neighborhood is smaller.

作者于绍越商琳

机构地区南京大学计算机软件新技术国家重点实验室

出处《南京大学学报（自然科学版）》 CAS CSCD 北大核心 2008年第2期212-218,共7页 Journal of Nanjing University（Natural Science）

基金国家自然科学基金(60503022)

关键词离群点离散属性信息熵 outlier, categorical attribute, entropy

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献13

1Edwin M K, Raymond T N, Vladimir T. Distance-based outliers: Algorithms and applications. VLDB Journal, 2000,8(3-4):237-253.
2Chen K, Liu L. The "Best K" for entropy-based categorical data clustering. Proceedings of the 17th International Conference on Scientific and Statistical Database Management, 2005, 253-262.
3Stephen D B, Mark S. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proceedings of 9th Annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 29-38.
4刘君强,王勋,孙晓莹.多维多层关联规则有效挖掘的新算法[J].南京大学学报（自然科学版）,2003,39(2):205-210. 被引量：9
5Barbara D, Li Y, Couto J. Coolcat: An entropy-based algorithm for categorical clustering. Proceedings of ACM Conference on Information and Knowledge Management (CIKM), 2002, 582-589.
6Li T, Ma S, Mitsunori O. Entropy-based criterion in categorical clustering. Proceedings of Internal Conference on Machine Learning (ICML), 2004.
7He Z Y, Xu X F, Deng S C. An optimization model for outlier detection in categorical data. Proceedings, Part Ⅰ. Lecture Notes in Computer Science of Advances in Intelligent Computing, International Conference on Intelligent Computing, 2005, 23-26.
8Markus M B, Hans-Peter K, Raymond T N, et al. LOF: Identifying density-based local outliers. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000,93-104.
9Claude E S. A mathematical theory of communication. Bell System Techical Journal, 1948 (27): 379-423, 623-656.
10Edwin M K. Outliers and data mining: Finding exceptions in data Ph. Do Thesis. The University of British Columbia, 2002.

二级参考文献10

1Ganti V, Gehrke J, Ramakrishnan R. Mining very large databases. Computer, 1999, 32(8) : 38-45.
2Han J, Fu Y. Discovery of multiple-level association rules from large databases. IEEE Transactions on Knowledge and Data Engineering, 1999, 11(5) : 798-805.
3Srikant R, Agrawal R. Mining generalized association rules. Umeshwar D, Peter M D G, Shojiro N.Proceedings of the 21st Intonational Conference on Very Large Data Bases. San Francisieo: Morgan Kaufmann Publishers Inc, 1995: 407-419.
4Agrawal R, Srikant R. Fast algorithms for mining association rules. Jorge B B, Matthias J, Carlo Z.Proceedings of the 20th International Conference on Very Large Data Bases. San Francisico: Morgan Kaufmann Publishers Inc, 1994: 487-499.
5Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases.Peter B, Suslail J. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. ACM Press, 1993: 207-216.
6Miller R J, Yang Y. Association rules over interval data. Joan P. Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. ACM Press, 1997: 452-461.
7Srikant R, Agrawal R. Mining quantitative association rules in large relational tables. Jagadish H V,Inderpal S M. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.ACM Press, 1996: 1-12.
8Department of Information and Computer Science, University of California at Irvine. UCI machine learning repository, http://www. ics. uci. edu/-mlearn/MLRepository. html, 2000.
9Christian B. Apriori implementation. http://fuzzy. cs. uni-magdeburg. de/-borgelt/src/apriori.exe,2000.
10邹翔,张巍,蔡庆生,王清毅.大型数据库中的高效序列模式增量式更新算法[J].南京大学学报（自然科学版）,2003,39(2):165-171. 被引量：10

共引文献8

1刘德喜,何炎祥,邢显黎.基于下钻操作的多层关联规则挖掘算法研究[J].三峡大学学报（自然科学版）,2006,28(2):169-173.
2刘德喜,邢显黎,孙南海.关联规则的上探研究[J].襄樊学院学报,2006,27(5):54-58.
3王结臣,李永全,钱晨晖.GIS中弧段数据结构的扩展与应用[J].南京大学学报（自然科学版）,2008,44(1):77-84.
4刘红星,王崇骏,谢俊元.基于图的最大频繁项集的生成算法[J].南京大学学报（自然科学版）,2008,44(5):520-526. 被引量：2
5王妍,潘瑜春,王慧.基于Voronoi和信息熵的空间离群样点检测[J].计算机工程与设计,2010,31(18):3998-4000. 被引量：4
6胡云,李慧,施珺,蔡虹.基于属性约简和相对熵的离群点检测算法[J].山东大学学报（工学版）,2011,41(6):31-36.
7冯岭,王丽珍,高世健.一种带稀有特征的空间co-location模式挖掘新方法[J].南京大学学报（自然科学版）,2012,48(1):99-107. 被引量：13
8李勇男.基于多层次关联规则挖掘的反恐情报跨层特征关联分析[J].情报科学,2021,39(11):127-132. 被引量：8

同被引文献104

1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量：161
2熊家军,李庆华.信息熵理论与入侵检测聚类问题研究[J].小型微型计算机系统,2005,26(7):1163-1166. 被引量：14
3张净,孙志挥.GDLOF:基于网格和稠密单元的快速局部离群点探测算法[J].东南大学学报（自然科学版）,2005,35(6):863-866. 被引量：6
4倪巍伟,陆介平,陈耿,孙志挥.基于k均值分区的数据流离群点检测算法[J].计算机研究与发展,2006,43(9):1639-1643. 被引量：20
5周晓云,张净,孙志挥.高维Turnstile型数据流聚类算法[J].计算机科学,2006,33(11):14-17. 被引量：6
6薛萍,金鸿章,王双.应用最大熵原理分析通信系统脆性风险[J].电机与控制学报,2007,11(1):74-78. 被引量：1
7薛安荣,鞠时光.基于空间约束的离群点挖掘[J].计算机科学,2007,34(6):207-209. 被引量：12
8薛安荣,鞠时光,何伟华,陈伟鹤.局部离群点挖掘算法研究[J].计算机学报,2007,30(8):1455-1463. 被引量：96
9HAN Jiawei,KAMBER M.Data mining:concepts and techniques[M].Bejing:China Machine Press,2006:254-255.
10HAWKINS D.Identification of outliers[M].London:Chapman and Hall,1980:2-28.

引证文献11

1张贺,蔡江辉,张继福,乔衎.信息熵度量的离群数据挖掘算法[J].智能系统学报,2010,5(2):150-155. 被引量：7
2孙浩,何晓红.动态数据环境下基于信息熵的相对离群点检测算法[J].计算机应用,2010,30(5):1284-1286. 被引量：1
3张净,孙志挥,宋余庆,倪巍伟,晏燕华.基于信息论的高维海量数据离群点挖掘[J].计算机科学,2011,38(7):148-151. 被引量：10
4蔡江辉,孟文俊,孙士卫,赵旭俊,张继福.基于信息熵的变星光谱快速识别方法[J].光谱学与光谱分析,2012,32(1):255-258. 被引量：2
5胡云,李慧,施珺,蔡虹.基于属性约简和相对熵的离群点检测算法[J].山东大学学报（工学版）,2011,41(6):31-36.
6朱娟,吉根林.基于相邻关系的地理标识语言空间线对象离群检测算法[J].南京大学学报（自然科学版）,2012,48(1):84-90.
7李文忠,左万利,赫枫龄.一种基于信息熵的多维流数据噪声检测算法[J].计算机科学,2012,39(2):191-194. 被引量：5
8孙伟伟.基于基尼指标加权的离群子空间与离群数据挖掘方法[J].电脑开发与应用,2012,25(10):35-37. 被引量：1
9石岩,刘爱琴,张继福.一种基于基尼指标的高维数据离群挖掘算法[J].太原科技大学学报,2013,34(3):161-165. 被引量：3
10刘爱琴,荀亚玲.基于属性熵和加权余弦相似度的离群算法[J].太原科技大学学报,2014,35(3):171-175. 被引量：5

二级引证文献35

1刘婧瑶.基于聚类挖掘的安全阀试验位移数据处理[J].煤矿开采,2011,16(5):8-10.
2杨茂林,卢炎生.基于剪枝的海量数据离群点挖掘[J].计算机科学,2012,39(10):152-156. 被引量：6
3孙伟伟.基于基尼指标加权的离群子空间与离群数据挖掘方法[J].电脑开发与应用,2012,25(10):35-37. 被引量：1
4何九虎,刘飞.工业过程数据异常检测的改进局部离群因子法[J].计算机与应用化学,2013,30(1):53-56. 被引量：4
5琚春华,李耀林.基于属性关联及匹配差异度的数据流异常检测[J].西南交通大学学报,2013,48(1):107-115.
6王敬华,赵新想,张国燕,刘建银.NLOF:一种新的基于密度的局部离群点检测算法[J].计算机科学,2013,40(8):181-185. 被引量：28
7李广霞,张思亮,崔哲.关联规则发现方法研究[J].软件导刊,2014,13(4):14-16. 被引量：1
8刘爱琴,荀亚玲.基于属性熵和加权余弦相似度的离群算法[J].太原科技大学学报,2014,35(3):171-175. 被引量：5
9李广霞,崔哲.数据挖掘在事业单位绩效工资管理中的应用[J].石家庄职业技术学院学报,2014,26(4):11-13.
10陈利,张利,班晓芳,梁杰.基于信息熵的加密会话检测方法[J].计算机科学,2015,42(1):142-143. 被引量：5

1董孟秋,李景文,张紫萍.基于面向对象数据模型的地理实体距离度量关系分析方法[J].测绘与空间地理信息,2014,37(5):64-67. 被引量：2
2肖瑶,甘忠辉,刘芸江,徐雪飞.一种GRID路由协议的网关选取方法[J].电视技术,2014,38(1):111-114. 被引量：2
3欧阳继红,王振鑫,景黎.扩展度量关系的9-交集模型[J].吉林大学学报（工学版）,2013,43(3):695-700.
4王敬华,赵新想,张国燕,刘建银.NLOF:一种新的基于密度的局部离群点检测算法[J].计算机科学,2013,40(8):181-185. 被引量：28
5刘亚威,李见为,张小洪.一种基于边缘轮廓线的LoG角点检测[J].计算机工程与应用,2010,46(10):140-143. 被引量：7
6赵新云,刘厚泉.基于概念格的语义室内导航[J].微计算机信息,2010,26(24):170-171. 被引量：3
7马林兵,龚健雅.面向自然语言的空间数据库查询研究[J].计算机工程与应用,2003,39(22):16-19. 被引量：20
8王新峰,邱静,刘冠军.核主元分析中核函数参数选优方法研究[J].振动．测试与诊断,2007,27(1):62-64. 被引量：14
9刘波,潘久辉,刘佩珊.规则评估方法与数据质量挖掘系统[J].计算机集成制造系统,2009,15(7):1436-1441. 被引量：3
10蒋伟平,王晓年,蒋平,朱劲.基于非均匀采样的图像射影畸变硬件校正[J].数据采集与处理,2011,26(2):200-206. 被引量：2

南京大学学报（自然科学版）

2008年第2期

浏览历史

内容加载中请稍等...