期刊文献+

基于相关子空间的扩展隔离森林离群检测算法

An Extended Isolation Forest Outlier Detection Algorithm Based on Relevant Subspace
下载PDF
导出
摘要 扩展隔离森林离群检测作为一种集成离群检测方法,可选取随机斜率的超平面,具有将离群数据与正常数据对象快速分离,时间复杂度较低等优点,但隔离树超平面选取在数据集密集区域或含有无关维度数据区域时,严重影响了其离群检测的效果。采用相关子空间思想和方法,提出了一种扩展隔离森林离群检测算法。该算法利用高斯混合模型确定数据对象的相关子空间,从而保证了能够在稀疏数据区域中选取隔离树的切割超平面;隔离树分枝分割优先在稀疏数据区域中,选择隔离树超平面的随机截距点,可快速地将离群数据对象从稀疏数据区域中隔离出来,从而避免了在超平面的随机斜率选取时无关属性维度的干扰;将每个数据对象在各隔离树上的平均路径长度归一化后作为离群得分,并选取离群得分最大的若干个数据对象作为离群数据;在UCI数据集上通过实验验证了该算法的有效性,以及抽样数、隔离树个数和近邻数参数对其离群检测效果的影响。 The extended isolation forest outlier detection algorithm,as an ensemble outlier detection method,can select the hyperplane of random slope and has the advantages in separating outliers from normal data and time complexity.But the hyperplane selection of the extended isolation tree in the dense area of the data set or the area with irrelevant dimensions is of great significance to the outlier detection effect.An extended isolation forest outlier detection algorithm is proposed by using the idea and method of relevant subspace.It utilizes Gaussian mixture model to definite the relevant subspace of data objects,which guarantees to select the branching hyperplane of the isolation tree in the sparse data area.During constructing each extended isolation tree,random intercept points of hyperplanes are preferentially selected in the data-sparse region so as to isolate outliers from the data-sparse region quickly.And it can avoid the interference of irrelevant attribute dimensions when selecting the hyperplane’s random slope.Then the outlier score of each data object is obtained by normalizing the average path length in each isolation tree,and the selection of several data objects with the largest outlier score is defined as the outliers.Experimental results validate the effectiveness of the algorithm and the effects of parameters,including sub-sample size,the number of isolation tree and nearest neighbors on outlier detection in UCI data sets.
作者 刘佳 朱鹏云 荀亚玲 LIU Jia;ZHU Peng-yun;XUN Ya-ling(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China)
出处 《计算机技术与发展》 2022年第10期26-33,40,共9页 Computer Technology and Development
基金 国家自然科学基金项目(61602335) 山西省自然科学基金(201901D211302)。
关键词 离群检测 扩展隔离森林 相关子空间 高斯混合模型 稀疏数据区域 outlier detection extended isolation forest relevant subspace Gaussian mixture model sparse data area
  • 相关文献

参考文献6

二级参考文献43

  • 1刘中田,李乡儒,吴福朝,赵永恒.基于小波特征的M型星自动识别方法[J].电子学报,2007,35(1):157-160. 被引量:11
  • 2张继福,蔡江辉.面向LAMOST的天体光谱离群数据挖掘系统研究[J].光谱学与光谱分析,2007,27(3):606-609. 被引量:6
  • 3蒋义勇,张继福,张素兰.基于链表结构的概念格渐进式构造[J].计算机工程与应用,2007,43(11):178-180. 被引量:11
  • 4Knorr E M, Ng R T. Algorithms formining distance-based outliers in large datasets. In: Proceedings of the 24th International Conference on Very Large Data Bases. San Francisco, USA: Morgan Kaufmann Publishers, 1998. 392-403.
  • 5Han J W, Kamber M. Data Mining Concepts and Techniques. San Francisco: Morgan Kaufmann Publishers, 2001.
  • 6Barnett V, Lewis T. Outliers in Statistical Data. New York: John Wiley-Sons, 1994.
  • 7Arning A, Agrawal R, Rghavan P. A linear method for deviation detection in large database. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portlan, Oregon: Morgan Kaufmann Publishers. 1996. 164-169.
  • 8Breunig M M, Kriegel H P, Ng R T, Sander J. LOF: identifying density-based local outliers. ACM Special Interest Group on Management of Data Record, 2000, 29(2): 93-104.
  • 9Agarwal C, Yu S. An effective and efficient algorithm for high-dimensional outlier detection. The International Journal on Very Large Data Bases, 2005, 14(2): 211-221.
  • 10Wille R. Restructuring lattice theory: an approach based on hierarchies of concepts. Ordered Sets, 1982, 11(5): 445-470.

共引文献160

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部