摘要
噪声是影响机器学习模型可靠性的重要因素,而标签噪声相比特征噪声对模型训练更具决定性的影响。噪声过滤是处理标签噪声的一种有效方法,它不需要估计噪声率,也不需要依赖任何损失函数,然而目前大多数标签噪声过滤算法都会面临过度清洗问题。针对此问题,文中提出了基于异常检测的标签噪声过滤框架,并在此框架下给出了一种自适应近邻聚类的标签噪声过滤算法AdNN(Label Noise Filtering via Adaptive Nearest Neighbor Clustering)。该算法分别考虑分类问题中的每一个类别,把标签噪声检测问题转化成离群点检测问题,识别出每一个类别的离群点,然后根据相对密度去除离群点中的非噪声样本,得到噪声备选集,最后通过噪声因子对噪声备选集中的离群点进行噪声识别和过滤。实验结果表明,在合成数据集和公开数据集上,所提噪声过滤方法可以减轻过度清洗现象,同时能够得到很好的噪声过滤效果和分类预测性能。
Noise is an important factor affecting the reliability of machine learning models,and label noise has more decisive in-fluence on model training than feature noise.Reducing label noise is a key step in classification tasks.Filtering noise is an effective way to deal with label noise,and it neither requires estimating the noise rate nor relies on any loss function.However,most filtering algorithms may cause overcleaning phenomenon.To solve this problem,a label noise filtering framework based on outlier detection is proposed firstly,and a label noise filtering algorithm via adaptive nearest neighbor clustering(AdNN)is then presented.AdNN transforms the label noise detection into the outlier detection problem.It considers samples in each category separately,and all outliers will be identified.Samples belong to outliers will be ignored according to relative density,and real label noise belong to outliers will be found and removed by defined noise factor.Experiments on some synthetic and benchmark datasets show that the proposed noise filtering method can not only alleviate the overcleaning phenomenon,but also obtain good noise filtering effect and classification prediction performance.
作者
许茂龙
姜高霞
王文剑
XU Maolong;JIANG Gaoxia;WANG Wenjian(College of Computer and Information Technology,Shanxi University,Taiyuan 030006,China;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University,Taiyuan 030006,China)
出处
《计算机科学》
CSCD
北大核心
2024年第2期87-99,共13页
Computer Science
基金
国家自然科学基金(U21A20513,62076154,61906113)
山西省高等学校科技创新项目(2020L0007)。
关键词
标签噪声过滤
离群点检测
自适应k近邻
相对密度
噪声因子
Label noise filtering
Outlier detection
Adaptive k-nearest neighbors
Relative density
Noise factor