改进DPC聚类算法的离群点检测与解释方法

Improved outlier detection and interpretation method for DPC clustering algorithm

下载PDF

导出

摘要为解决全局离群点检测方法无法对局部离群点进行检测,以及局部异常因子在面对大量局部离群点时性能下降的问题,利用k近邻(KNN)和核密度估计方法(KDE)提出一种基于改进快速搜索和发现密度峰值聚类算法(KDPC)的离群点检测与解释方法,该方法能够同时对数据点的全局和局部进行分析。首先,利用k近邻和核密度估计方法计算数据点的局部密度,代替传统DPC算法中根据截断距离计算的局部密度。其次,将数据点的k近邻距离之和作为全局异常值,并通过KDPC聚类算法计算簇密度以及数据点的局部异常值。最后,将数据点的全局与局部异常值进行乘积作为最终异常得分,选取异常得分最高的Top-n作为离群点,通过构建全局-局部异常值决策图对全局和局部离群点进行解释。利用人工数据集和UCI数据集进行实验并与10种常用离群点检测方法进行比较。结果表明,该方法对全局和局部离群点都有着较高的检测精度和检测性能,并且AUC方面受k值影响较小。同时,利用该方法对NBA球员数据进行分析讨论,进一步证明了该方法的实用性和有效性。 To address the limitatios of global outlier detection methods in detecting local outliers and the performance degradation of local anomaly factors in the presence of a large number of local outliers,this paper proposes an outlier detection and interpretation method based on an improved fast search and discovery density peak clustering algorithm(KDPC),utilizing k-nearest neighbor(KNN)and kernel density estimation(KDE)methods.This method enables simultaneous analysis of both global and local data points.Firstly,the local density of data points is calculated using the k-nearest neighbor and kernel density estimation methods instead of the local density based on the truncation distance in the traditional DPC algorithm.Secondly,the sum of the k-nearest neighbor distances of the data points is used as the global outlier and the cluster density as well as the local outliers of the data points are calculated by the KDPC clustering algorithm.Finally,the global and local outliers of the data points are multiplied as the final anomaly score.The Top-n data points with the highest anomaly score is selected as the outlier,and the global and local outliers are interpreted by constructing a global-local outlier decision diagram.Experiments were conducted using both artificial and UCI datasets and our method was compared with 10 commonly used outlier detection methods.The results show that our method achieves high detection accuracy and performance for both global and local outliers.Moreover,the AUC performance is minimally affected by the k-value.Additionally,our method is also used to analyze NBA player data,further demonstrating its practicality and effectiveness.

作者周玉夏浩裴泽宣 ZHOU Yu;XIA Hao;PEI Zexuan(School of Electrical Engineering,North China University of Water Resources and Electric Power,Zhengzhou 450045,China)

机构地区华北水利水电大学电气工程学院

出处《哈尔滨工业大学学报》 EI CAS CSCD 北大核心 2024年第8期68-85,共18页 Journal of Harbin Institute of Technology

基金国家自然科学基金(U1504622,31671580) 河南省高等学校青年骨干教师培养计划项目(2018GGJS079)。

关键词离群点检测聚类密度峰值 K近邻核密度估计 outlier detection clustering density peaks k-nearest neighbors kernel density estimation

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献6

1Shinjin Kang,Soo Kyun Kim.Outlier Behavior Detection for Indoor Environment Based on t-SNE Clustering[J].Computers, Materials & Continua,2021(9):3725-3736. 被引量：2
2Huangjian WU,Xiao TANG,Zifa WANG,Lin WU,Miaomiao LU,Lianfang WEI,Jiang ZHU.Probabilistic Automatic Outlier Detection for Surface Air Quality Measurements from the China National Environmental Monitoring Network[J].Advances in Atmospheric Sciences,2018,35(12):1522-1532. 被引量：12
3张忠平,刘伟雄,张玉停,邓禹,魏棉鑫.ERDOF:基于相对熵权密度离群因子的离群点检测算法[J].通信学报,2021,42(9):133-143. 被引量：8
4周玉,朱文豪,房倩,白磊.基于聚类的离群点检测方法研究综述[J].计算机工程与应用,2021,57(12):37-45. 被引量：26
5张忠平,李森,刘伟雄,刘书霞.基于快速密度峰值聚类离群因子的离群点检测算法[J].通信学报,2022,43(10):186-195. 被引量：4
6周玉,朱文豪,孙红玉.一种基于目标函数的局部离群点检测方法[J].东北大学学报（自然科学版）,2022,43(10):1405-1412. 被引量：7

二级参考文献35

1董泽,贾昊.基于EWT-LOF的热工过程数据异常值检测方法[J].仪器仪表学报,2020,41(2):126-134. 被引量：25
2马少沛,孙庆慧,武雅萱,田茂再.大数据下张量充分降维方法及其应用研究[J].统计研究,2021,38(2):114-134. 被引量：4
3倪巍伟,陆介平,陈耿,孙志挥.基于k均值分区的数据流离群点检测算法[J].计算机研究与发展,2006,43(9):1639-1643. 被引量：20
4曾颖,罗可,邹瑞芝.基于K-均值聚类和凝聚聚类的离群点查找方法[J].计算机工程与应用,2009,45(29):131-133. 被引量：9
5张月琴.滑动窗口中数据流频繁项集挖掘方法[J].计算机工程与应用,2010,46(16):132-134. 被引量：8
6古平,刘海波,罗志恒.一种基于多重聚类的离群点检测算法[J].计算机应用研究,2013,30(3):751-753. 被引量：21
7王敬华,赵新想,张国燕,刘建银.NLOF:一种新的基于密度的局部离群点检测算法[J].计算机科学,2013,40(8):181-185. 被引量：28
8LIAO Jie,WANG Bin,LI Qingxiang.A New Method for Quality Control of Chinese Rawinsonde Wind Observations[J].Advances in Atmospheric Sciences,2014,31(6):1293-1304. 被引量：10
9潘本锋,郑皓皓,李莉娜,汪巍.空气自动监测中PM_(2.5)与PM_(10)“倒挂”现象特征及原因[J].中国环境监测,2014,30(5):90-95. 被引量：35
10王习特,申德荣,白梅,聂铁铮,寇月,于戈.BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51. 被引量：29

共引文献50

1Guanglin Jia,Zhijiong Huang,Xiao Tang,Jiamin Ou,Menghua Lu,Yuanqian Xu,Zhuangmin Zhong,Qing’e Sha,Huangjian Wu,Chuanzeng Zheng,Tao Deng,Duohong Chen,Min He,Junyu Zheng.A meteorologically adjusted ensemble Kalman filter approach for inversing daily emissions:A case study in the Pearl River Delta,China[J].Journal of Environmental Sciences,2022,34(4):233-248. 被引量：2
2晨笛.Jini—未来分布计算网络的模板[J].互联网世界,2000(1):35-36.
3黄盖先,田波,周云轩,袁庆.滨海湿地物联网观测数据预处理方法[J].吉林大学学报（地球科学版）,2019,49(6):1805-1814. 被引量：2
4雷山东,杨婷,柴文轩,王自发,唐桂刚,郑海涛,郝宏飞.联网激光雷达的观测性能综合平行比对[J].中国环境监测,2020,36(3):153-162. 被引量：1
5Alican Dogan,Derya Birant.A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection[J].Journal of Data and Information Science,2020,5(2):111-135.
6Biiuu Chu,Shuping Zhang,Jun Liu,Qingxin Ma,Hong He.Significant concurrent decrease in PM_(2.5) and NO_(2) concentrations in China during COVID-19 epidemic[J].Journal of Environmental Sciences,2021,33(1):346-353. 被引量：26
7吴煌坚,林伟,孔磊,唐晓,王威,王自发,陈松蹊.一种基于集合最优插值的排放源快速反演方法[J].气候与环境研究,2021,26(2):191-201.
8周玉,朱文豪,房倩,白磊.基于聚类的离群点检测方法研究综述[J].计算机工程与应用,2021,57(12):37-45. 被引量：26
9李伟,祁海峰,郑骥,陶光灿.舆情传播主体关系相关性分析[J].传媒论坛,2021,4(22):1-3.
10郑传增,贾光林,余宇帆,陆梦华,王自发,唐晓,吴煌坚,黄志炯,郑君瑜.基于EnKF排放清单反演方法的关键影响参数评估与优化[J].中国环境科学,2022,42(9):4043-4051. 被引量：1

1张浚坤,雷二涛,罗崴,金莉,马凯,李盈.基于多算法融合的锂离子电池故障诊断方法[J].广东电力,2024,37(7):50-57.
2魏正新,吕晗珺,闵永军,张涌.基于局部离群点检测的动力电池组不一致早期故障预警[J].重庆理工大学学报（自然科学）,2024,38(6):21-29.
3孔翎超,刘国柱.离群点检测算法综述[J].计算机科学,2024,51(8):20-33.
4鲍加迪,方怡莹,张紫薇,朱梦韬,李云杰.基于自编码器的多功能雷达工作状态切换点检测方法[J].北京理工大学学报,2024,44(7):761-770.
5周玉,夏浩,岳学震,王培崇.基于改进K-means的局部离群点检测方法[J].工程科学与技术,2024,56(4):66-77.
6朱华,乔勇进,董国钢.基于CART决策树的分布式数据离群点检测算法[J].现代电子技术,2024,47(16):157-162.
7射手传说斯蒂芬·库里的故事(17)[J].NBA特刊,2024(12):76-83.
8马天磊,符俊,马琪,杨震,刘新浩.基于全局与局部多尺度上下文的电表数据检测[J].应用光学,2024,45(4):804-811.
9朱金,徐天杰,王平心.基于蚁群算法的三支k-means聚类算法[J].江苏科技大学学报（自然科学版）,2024,38(3):63-69.
10赵志忠,陈素根.基于相互K近邻的密度峰值聚类算法[J].安庆师范大学学报（自然科学版）,2024,30(2):41-46.

哈尔滨工业大学学报

2024年第8期

浏览历史

内容加载中请稍等...

改进DPC聚类算法的离群点检测与解释方法

参考文献6

二级参考文献35

共引文献50

相关作者

相关机构

相关主题

浏览历史