摘要
为解决全局离群点检测方法无法对局部离群点进行检测,以及局部异常因子在面对大量局部离群点时性能下降的问题,利用k近邻(KNN)和核密度估计方法(KDE)提出一种基于改进快速搜索和发现密度峰值聚类算法(KDPC)的离群点检测与解释方法,该方法能够同时对数据点的全局和局部进行分析。首先,利用k近邻和核密度估计方法计算数据点的局部密度,代替传统DPC算法中根据截断距离计算的局部密度。其次,将数据点的k近邻距离之和作为全局异常值,并通过KDPC聚类算法计算簇密度以及数据点的局部异常值。最后,将数据点的全局与局部异常值进行乘积作为最终异常得分,选取异常得分最高的Top-n作为离群点,通过构建全局-局部异常值决策图对全局和局部离群点进行解释。利用人工数据集和UCI数据集进行实验并与10种常用离群点检测方法进行比较。结果表明,该方法对全局和局部离群点都有着较高的检测精度和检测性能,并且AUC方面受k值影响较小。同时,利用该方法对NBA球员数据进行分析讨论,进一步证明了该方法的实用性和有效性。
To address the limitatios of global outlier detection methods in detecting local outliers and the performance degradation of local anomaly factors in the presence of a large number of local outliers,this paper proposes an outlier detection and interpretation method based on an improved fast search and discovery density peak clustering algorithm(KDPC),utilizing k-nearest neighbor(KNN)and kernel density estimation(KDE)methods.This method enables simultaneous analysis of both global and local data points.Firstly,the local density of data points is calculated using the k-nearest neighbor and kernel density estimation methods instead of the local density based on the truncation distance in the traditional DPC algorithm.Secondly,the sum of the k-nearest neighbor distances of the data points is used as the global outlier and the cluster density as well as the local outliers of the data points are calculated by the KDPC clustering algorithm.Finally,the global and local outliers of the data points are multiplied as the final anomaly score.The Top-n data points with the highest anomaly score is selected as the outlier,and the global and local outliers are interpreted by constructing a global-local outlier decision diagram.Experiments were conducted using both artificial and UCI datasets and our method was compared with 10 commonly used outlier detection methods.The results show that our method achieves high detection accuracy and performance for both global and local outliers.Moreover,the AUC performance is minimally affected by the k-value.Additionally,our method is also used to analyze NBA player data,further demonstrating its practicality and effectiveness.
作者
周玉
夏浩
裴泽宣
ZHOU Yu;XIA Hao;PEI Zexuan(School of Electrical Engineering,North China University of Water Resources and Electric Power,Zhengzhou 450045,China)
出处
《哈尔滨工业大学学报》
EI
CAS
CSCD
北大核心
2024年第8期68-85,共18页
Journal of Harbin Institute of Technology
基金
国家自然科学基金(U1504622,31671580)
河南省高等学校青年骨干教师培养计划项目(2018GGJS079)。
关键词
离群点检测
聚类
密度峰值
K近邻
核密度估计
outlier detection
clustering
density peaks
k-nearest neighbors
kernel density estimation