摘要
不平衡数据集是指在数据集中,某一类样本的数量远大于其他类样本的数量,其会影响分类结果,使基本分类器偏向多数类。合成少数样本过采样技术(SMOTE)是处理数据不平衡问题的一种经典过采样方法,以两个少数样本对应的线段为端点生成一个合成样本。提出一种基于SMOTE的少数群体过采样方法,改进生成新样本的方式,在合成样本的过程中参考两个以上的少数类样本,增加合成样本的多样性。实验结果表明,在不同的基本分类器下该方法可以获得更好的接收者操作特征曲线面积(ROC-AUC)和稳定性。
The imbalanced data set refers to more instances in one class than that in other classes,which can influence classification results,and make basic classifiers have bias towards the majority class.Synthetic minority over-sampling technique(SMOTE)is one of over-sampling methods dealing with data imbalance problem,this method generates one synthetic sample according to a line segment of two minority samples as endpoint.This paper proposes a new over-sampling method of the minority class based on SMOTE.This method made improvement on how to generate new samples,it took more than two real samples into account to generate one synthetic sample,which increased diversity of synthetic samples.The experimental results show that this method achieves better area under curve and stability.
作者
张天翼
丁立新
Zhang Tianyi;Ding Lixin(School of Computer Science,Wuhan University,Wuhan 430072,Hubei,China)
出处
《计算机应用与软件》
北大核心
2021年第9期273-279,共7页
Computer Applications and Software
基金
广东省珠海市产学研合作项目(2010A090200067,2016B090918097,2012D0501990016,2012D0501990026)。
关键词
不平衡数据集
过采样
样本合成
分类
Imbalanced dataset
Over-sampling
Sample synthesis
Classification