摘要
类别不平衡问题是机器学习与数据挖掘领域中主要关注的问题之一,目前已有多种解决方法,而样本采样技术是其中最为简单有效、同时也是最为常用的一类方法.本文主要针对SMOTE(synthetic minority oversampling technique)这一最为流行的采样算法易于受到噪声样本影响及泛化能力差的缺点,提出了一种基于概率密度估计的改进算法.首先,假定各类样本均服从高斯混合分布,并采用高斯混合模型测得各样本的概率密度,针对各样本在类内与类间所测得概率密度间的排序比较关系来实现噪声信息的过滤.其次,在过滤后的少数类样本上进行概率密度的重新计算,并根据其特点将其划分为三类:边界样本、安全样本与离群样本.最后,针对上述三类样本,分别采取不同的策略来进行SMOTE采样.此外,为了进一步提升泛化性能,本文也对SMOTE算法的邻域计算规则进行了修正.通过多个基准的二类不平衡数据集对该算法进行了验证,实验结果表明其是有效且可行的,同时显著优于多种已有的采样算法.
Class imbalance problem is one of the main problems in the fields of machine learning and data mining. To address this problem,the researchers have proposed lots of methods,in which instance sampling is the simplest,the most effective and the most used approach. As a popular instance sampling algorithm,SMOTE(synthetic minority oversampling technique)tends to be influenced by the noise instances and has poor generalization ability. To deal with this problem, an improved SMOTE algorithm which considers the probability density information is presented in this paper. Firstly, we assume that the instances in each class satisfy Gaussian mixture distribution,hence the Gaussian mixture model is adopted to estimate the probability density of each instance. Then the noisy instances could be removed by comparing rankings of the intra-class and inter-class probability density information. Next,the probability density information would be calculated again on the filtered data set,and then the instances belonging to the minority class could be divided into three groups as below:boundary,safety and outlier. Finally,for the instances in different group,different SMOTE strategies are used to generate the new instances. In addition,to further promote the generalization,the neighborhood calculation rule in SMOTE has also been modified. The experimental results on several binary-class imbalance data sets indicate that the proposed algorithm is effective and feasible. Moreover,it also shows that the proposed algorithm is significantly better than multiple previous algorithms.
作者
李涛
郑尚
邹海涛
于化龙
Li Tao;Zheng Shang;Zou Haitao;Yu Hualong(School of Computer Science,Jiangsu University of Science and Technology,Zhenjiang 212003,China)
出处
《南京师大学报(自然科学版)》
CAS
CSCD
北大核心
2019年第1期65-72,共8页
Journal of Nanjing Normal University(Natural Science Edition)
基金
国家自然科学基金(61305058
61572242)
江苏省自然科学基金(BK20130471)
中国博士后特别资助计划项目(2015T80481)
中国博士后科学基金(2013M540404)
江苏省博士后基金(1401037B)
关键词
类别不平衡
概率密度
样本采样
SMOTE
高斯混合分布
class imbalance
probability density
instance sampling
SMOTE
Gaussian mixture distribution