期刊文献+

基于局部信息熵的加权子空间离群点检测算法 被引量:28

Local Entropy Based Weighted Subspace Outlier Mining Algorithm
下载PDF
导出
摘要 离群点检测作为数据挖掘的一个重要研究方向,可以从大量数据中发现少量与多数数据有明显区别的数据对象."维度灾殃"现象的存在使得很多已有的离群点检测算法对高维数据不再有效.针对这一问题,提出基于局部信息熵的加权子空间离群点检测算法SPOD.通过对数据对象在各维进行邻域信息熵分析,生成数据对象相应的离群子空间和属性权向量,对离群子空间中的属性赋以较高的权值,进一步提出子空间加权距离等概念.采用基于密度离群点检测的思想,分析计算数据对象的子空间离群影响因子,判断是否为离群点.算法能够有效地适应于高维数据离群点检测,理论分析和实验结果表明算法是有效可行的. Outlier mining has become a hot issue in the field of data mining,which is to find exceptional objects that deviate from the most rest of the data set.However,along with the increase of dimension,some unusual characteristic appearance becomes possible,such as spatial distribution of the data,and the distance of full attribute space is no longer meaningful,which is called "curse of dimensionality".Phenomena of "curse of dimensionality" deteriorate lots of existing outlier detection algorithms' validity.Concerning this problem,a local entropy based weighted subspace outlier mining algorithm SPOD is proposed,which generates outlier subspace and weighted attribute vector of each data object by analyzing entropy of each attribute on the neighborhood of this data object.For a given data object,those outlier attributes which constitute this object's outlier subspace,are assigned with bigger weight.Furthermore definitions such as subspace weighted distance are introduced to make a density-based outlier processing upon the data set and get each data point's subspace outlier influence factor.The bigger this factor is,the bigger the possibility of the corresponding data point becoming an outlier is.Theoretical analysis and experimental results testify that SPOD is suitable for datasets with high dimension,and is efficient and effective.
出处 《计算机研究与发展》 EI CSCD 北大核心 2008年第7期1189-1194,共6页 Journal of Computer Research and Development
基金 江苏省自然科学基金项目(BK2006095) 教育部高等学校博士学科点专项科研基金项目(20040286009)
关键词 高维数据 离群点检测 信息熵 子空间挖掘 权向量 high dimensional data outlier detection information entropy subspace mining weighted vector
  • 相关文献

参考文献10

  • 1Johnson T, Kwok I, Ng R. Fast computation of 2- dimensional depth contours [C]//Gregory Piatetsky-Shapiro ed. Proc of the 4th Int'l Conf on Knowledge Discovery and Data Mining. New York: ACM, 1998:224-228
  • 2Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets [C]//A Gupta, O Shmueli, J Widom, eds. Proc of the 24th Int'l Conf on Very Large Databases. New York: ACM, 1998:392-403
  • 3Breunig M M, Kriegel H, NgR T, etal. LOF: Identifying density-based local outliers [C]//W D Chen, J F Naughton, P A Bernstein, eds. Proc of the 2000 ACM SIGMOD Int'l Conf on Management of Data. New York: ACM, 2000: 93- 104
  • 4Papadimitirou S, Kitagawa H, Gibbons P B, et al. LOCI: Fast outlier detection using the local correlation integral [C]//U Dayal, K Ramamritham, T M Vijayaraman, eds. Proc of the 19th Int'l Conf on Data Engineering. Los Alamitos: IEEE Computer Society, 2003:315-326
  • 5Aggarwal C, Yu P. Outlier detection for high dimensional data[C] //SIGMOD 2001. New York: ACM, 2001
  • 6Jin Wen, Tung Anthony K H, Han Jiawei, et al. Ranking outliers using symmetric neighborhood relationship [C]// Proc of PAKDD 2006. Berlin: Springer, 2006:577-593
  • 7李存华,孙志挥.GridOF:面向大规模数据集的高效离群点检测算法[J].计算机研究与发展,2003,40(11):1586-1592. 被引量:28
  • 8Christian Bohm, Karin Kailing, Hans-Peter Kriegel, et al. Density connected clustering with local subspace preferences [C]//The 4th Int'l Conf on Data Mining (ICDM). Los Alamitos: IEEE Computer Society, 2004:27-34
  • 9He Zengyou, Xu Xiaofei, Deng Shengchun. A fast greedy algorithm for outlier mining [C] //Proc of PAKDD 2006. Berlin: Springer, 2006:567-576
  • 10He Zengyou, Xu Xiaofei, Deng Shengchun. An optimization model for outlier detection in categorical data [G]//LNCS 3644. Berlin: Springer, 2005:400-409

二级参考文献7

  • 1D Hawkins. Identification of Outliers. London: Chapman and Hall, 1980.
  • 2T Johnson, I Kwok, R Ng. Fast computation of 2-dimensional depth contours. In: Proc of the 4th Int'l Conf on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998. 224-228.
  • 3E M Knorr, R T Ng. Algorithms for mining distance-based outliers in large datasets. In: Proc of the 24th Int'l Conf on Very Large Databases. New York: Morgan Kaufmann, 1998. 392-403.
  • 4D Yu, G Sheikholeslami, A Zhang. Findout: Finding outliers in very large datasets. Department of Computer Science and Engineering, State University of New York at Buffalo, Tech Rep:99-03, 1999. http://www. cse. buffalo. edu/tech-reports.
  • 5M Breunig, H Kriegel, R T Ng et al. LOF: Identifying densitybased local outliers. In: Proc of ACM SIGMOD Int'l Cortf on Management of Data. Dallas, Texas: ACM Press, 2000. 93-104.
  • 6M Joshi, R Agarwal, V Kumar. Mining needles in a haystack:Classifying rare classes via two-phase rule induction. In: Proc of ACM SIGMOD Int'l Conf on Management of Data. Santa Barbara, CA: ACM Press, 2001. 91-102.
  • 7H Samet. The Design and Analysis of Spatial Data Structures.Boston, MA: Addison-Wesley, 1990.

共引文献27

同被引文献242

引证文献28

二级引证文献201

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部