摘要
在人们的生活中存在大量的不平衡数据,如何识别人们感兴趣的少数类是一个具有挑战性的问题。论文基于ADASYN算法中提出的样本学习复杂度的思想,设计了一种新的过采样方法LDSMOTE。在该方法中,少数类主样本的学习复杂度与该主样本在少数类和多数类样本空间的分布都有关,ADASYN只利用了邻域多数类样本分布信息,而LDSMOTE融合了局部少数类平均距离和局部多数类样本数的信息。不同于ADASYN中复杂度是离散值,论文中的复杂度是连续的值,更能表现不同主样本之间的差异性和复杂度的多样性。分类器使用支持向量机,对KEEL不平衡数据库中的19个数据集进行实验,结果表明,在超过半数的数据集上,LDSMOTE的Recall、G-mean和AUC性能优于SMOTE、Borderline-SMOTE以及ADASYN算法。
There is a large amount of imbalanced data in people's lives,and how to identify the minority class which people are interested in is a challenging problem.Based on the idea of sample learning complexity proposed in the ADASYN algorithm,a new oversampling method LDSMOTE is designed.In this method,the learning complexity of a minority class main samples is relat⁃ed to the distribution of the main sample in the minority class and the majority class sample space.ADASYN only uses the neighbor⁃hood majority class sample distribution information,while LDSMOTE fuses average distance of the local minority class and informa⁃tion on the number of local majority samples.Unlike the complexity in ADASYN,which is a discrete value,the complexity in this paper is a continuous value,which is more representative of the diversity of differences and complexity between different main sam⁃ples.The classifier uses the support vector machine to experiment with 19 data sets in the KEEL imbalanced database.The results show that LSDMOTE's Recall,G-mean and AUC performance is better than SMOTE,Borderline-SMOTE and ADASYN algorithm on more than half of the data sets.
作者
许皓
孙廷凯
XU Hao;SUN Tingkai(School of Computer Science and Engineering,Nanjing University of Science and Technology,Nanjing 210094)
出处
《计算机与数字工程》
2020年第8期1846-1851,1857,共7页
Computer & Digital Engineering
关键词
过采样
不平衡数据
主样本
学习复杂度
样本分布
oversampling
imbalanced data
main sample
learning complexity
sample distribution