摘要
针对SMOTE(Synthetic Minority Over-sampling Technique)等传统过采样算法存在的忽略类内不平衡、扩展少数类的分类区域以及合成的新样本高度相似等问题,基于综合考虑类内不平衡和合成样本多样性的思想,提出了一种整合DBSCAN和改进SMOTE的过采样算法DB-MCSMOTE(DBSCAN and Midpoint Centroid Synthetic Minority Over-sampling Technique)。该算法对少数类样本进行DBSCAN聚类,根据提出的簇密度分布函数,计算各个簇的簇密度和采样权重,在各个簇中利用改进的SMOTE算法(MCSMOTE)在相距较远的少数类样本点之间的连线上进行过采样,提高合成样本的多样性,得到新的类间和类内综合平衡数据集。通过对一个二维合成数据集和九个UCI数据集的实验表明,DB-MCSMOTE可以有效提高分类器对少数类样本和整体数据集的分类性能。
For conventional oversampling algorithms,for example,SMOTE(Synthetic Minority Over-sampling Technique),there are several problems such as ignoring within-class imbalances,extending the classification regions of minority class and synthesizing highly similar samples.Based on the comprehensive consideration of within-class imbalance and synthetic samples in diversity,an oversampling algorithm,which is a hybrid of DBSCAN and improved SMOTE(DB-MCSMOTE),is proposed.It utilizes the DBSCAN algorithm to cluster the minority class samples.According to the proposed cluster density distribution function,the cluster density and sampling weight of each cluster are calculated.The MCSMOTE algorithm is adopted to oversample on the lines of the location-distant minority class samples in each cluster,the diversity of synthetic samples is improved and a new balanced dataset between and within classes is obtained.Experiments on a two-dimensional synthesis data set and nine UCI data sets show that DB-MCSMOTE can effectively improve the classification performance of the classifier for the minority class samples and the overall data set.
作者
王亮
冶继民
WANG Liang;YE Jimin(School of Mathematics and Statistics,Xidian University,Xi’an 710126,China)
出处
《计算机工程与应用》
CSCD
北大核心
2020年第18期111-118,共8页
Computer Engineering and Applications
基金
国家自然科学基金(No.61573014)
中央高校基本科研基金(No.JB180702)。
关键词
过采样
类内不平衡
少数类
多样性
SMOTE算法
DBSCAN算法
oversampling
within-class imbalance
minority class
diversity
Synthetic Minority Over-sampling Technique(SMOTE)algorithm
Density-Based Spatial Clustering of Applications with Noise(DBSCAN)algorithm