摘要
现实世界中广泛存在不平衡数据,其分类问题是数据挖掘和机器学习的一个研究热点.欠抽样是处理不平衡数据集的一种常用方法,其主要思想是选取多数类样本中的一个子集,使数据集的样本分布达到平衡,但其容易忽略多数类中部分有用信息.为此提出了一种基于样本权重的欠抽样方法KAcBag(K-means AdaCost bagging),该方法引入了样本权重来反映样本所处的区域,首先根据各类样本的数量初始化各样本权重,并通过多次聚类对各个样本的权重进行修改,权重小的多数类样本即处于多数类的中心区域;然后按权重大小对多数类样本进行欠抽样,使位于中心区域的样本较容易被抽中,并与所有少数类样本组成bagging成员分类器的训练数据,得到若干个决策树子分类器;最后根据各子分类器的正确率进行加权投票生成预测模型.对19组UCI数据集和某电信运营商客户换机数据进行了测试实验,实验结果表明:KAcBag方法使抽样所得的样本具有较强的代表性,能有效提高少数类的分类性能并缩小问题规模.
Imbalanced data exists widely in the real world,and its classification is a hot topic in data mining and machine learning.Under-sampling is a widely used method in dealing imbalanced data set and its main idea is choosing a subset of majority class to make the data set balanced.However,some useful majority class information may be lost.In order to solve the problem,an under-sampling method based on sample weight for imbalance problem is proposed,named as KAcBag(K-means AdaCost bagging).In this method,sample weight is introduced to reveal the area where the sample is located.Firstly,according to the sample scale,a weight is made for each sample and is modified after clustering the data set.The samples which have less weight in the center of majority class.Then some samples are drawn from majority class in accordance with the sample weight.In the procedure,the samples in the center of majority class can be selected easily.The sampled majority class samples and all the minority class samples are combined as the training data set for a component classifier.After that,we can get several decision tree sub-classifiers.Finally,the prediction model is constructed based on the accuracy of each sub-classifiers.Experimental tests on nineteen UCI data sets and telecom user data show that KAcBag can make the selected samples have more representativeness.Based on that,this method can improve the the classification performance of minority class and reduce the scale of the problem.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2016年第11期2613-2622,共10页
Journal of Computer Research and Development
基金
国家自然科学基金项目(61272060)
教育部人文社科规划基金项目(15XJA630003)
重庆市教委科学技术研究项目(KJ1500416)
重庆市自然科学基金项目(CSTC2013jjB40003)~~
关键词
不平衡数据
欠抽样
样本权重
聚类
集成学习
imbalanced data
under-sampling
sample weight
clustering
ensemble learning