摘要
针对传统基于相似度的离群点检测算法在高维不均衡数据集上效果不够理想的问题,提出一种新颖的基于随机投影与集成学习的离群点检测(ensemble learning and random projection-based outlier detection,EROD)框架。算法首先集成多个随机投影方法对高维数据进行降维,提升数据多样性;然后集成多个不同的传统离群点检测器构建异质集成模型,增加算法鲁棒性;最后使用异质模型对降维后的数据进行训练,训练后的模型经过两次优化组合以降低泛化误差,输出最终的对象离群值,离群值高的对象被算法判定为离群点。分别在四个不同领域的高维不均衡真实数据集上进行对比实验,结果表明该算法与传统离群点检测算法和基于集成学习的离群点检测算法相比,在AUC和precision@n值上平均提高了3.6%和14.45%,证明EROD算法具有处理高维不均衡数据异常的优势。
To address the problem that traditional similarity-based outlier detection algorithms were not effective enough on high-dimensional unbalanced datasets,this paper proposed a novel ensemble learning and random projection-based outlier detection(EROD)framework.Firstly,the EROD algorithm integrated several random projection methods to reduce the dimensionality of high-dimensional data,which improved the data diversity.Secondly,it integrated several different traditional outlier detectors to build a heterogeneous ensemble model,which increased the robustness of the algorithm.Finally,the EROD acquired the final outlier value of the object by using the heterogeneous ensemble model to train the reduced-dimensional data and by using two optimal combinations of the trained model to reduce the total error,and the algorithm determined the object with high outlier value as outlier point.The results show that the algorithm has an average improvement of 3.6%and 14.45%in AUC and precision@n value compared with the traditional outlier detection algorithm and the outlier detection algorithm based on ensemble learning.Therefore,the EROD algorithm has the advantage of handling the anomalies of high-dimensional unbalanced data.
作者
郭一阳
于炯
杜旭升
曹铭
Guo Yiyang;Yu Jiong;Du Xusheng;Cao Ming(College of Information Science&Engineering,Xinjiang University,Urumqi 830091,China;School of Software,Xinjiang University,Urumqi 830091,China;Ocean University of China,College of Information Science&Engineering,Qingdao Shandong 266100,China)
出处
《计算机应用研究》
CSCD
北大核心
2022年第9期2608-2614,共7页
Application Research of Computers
基金
国家自然科学基金资助项目(61862060,61462079,61562086,61562078)。
关键词
数据挖掘
离群点检测
随机投影
集成学习
data mining
outlier detection
random projection
ensemble learning