期刊文献+

基于样本权重的不平衡数据欠抽样方法 被引量:43

Under-Sampling Method Based on Sample Weight for Imbalanced Data
下载PDF
导出
摘要 现实世界中广泛存在不平衡数据,其分类问题是数据挖掘和机器学习的一个研究热点.欠抽样是处理不平衡数据集的一种常用方法,其主要思想是选取多数类样本中的一个子集,使数据集的样本分布达到平衡,但其容易忽略多数类中部分有用信息.为此提出了一种基于样本权重的欠抽样方法KAcBag(K-means AdaCost bagging),该方法引入了样本权重来反映样本所处的区域,首先根据各类样本的数量初始化各样本权重,并通过多次聚类对各个样本的权重进行修改,权重小的多数类样本即处于多数类的中心区域;然后按权重大小对多数类样本进行欠抽样,使位于中心区域的样本较容易被抽中,并与所有少数类样本组成bagging成员分类器的训练数据,得到若干个决策树子分类器;最后根据各子分类器的正确率进行加权投票生成预测模型.对19组UCI数据集和某电信运营商客户换机数据进行了测试实验,实验结果表明:KAcBag方法使抽样所得的样本具有较强的代表性,能有效提高少数类的分类性能并缩小问题规模. Imbalanced data exists widely in the real world,and its classification is a hot topic in data mining and machine learning.Under-sampling is a widely used method in dealing imbalanced data set and its main idea is choosing a subset of majority class to make the data set balanced.However,some useful majority class information may be lost.In order to solve the problem,an under-sampling method based on sample weight for imbalance problem is proposed,named as KAcBag(K-means AdaCost bagging).In this method,sample weight is introduced to reveal the area where the sample is located.Firstly,according to the sample scale,a weight is made for each sample and is modified after clustering the data set.The samples which have less weight in the center of majority class.Then some samples are drawn from majority class in accordance with the sample weight.In the procedure,the samples in the center of majority class can be selected easily.The sampled majority class samples and all the minority class samples are combined as the training data set for a component classifier.After that,we can get several decision tree sub-classifiers.Finally,the prediction model is constructed based on the accuracy of each sub-classifiers.Experimental tests on nineteen UCI data sets and telecom user data show that KAcBag can make the selected samples have more representativeness.Based on that,this method can improve the the classification performance of minority class and reduce the scale of the problem.
出处 《计算机研究与发展》 EI CSCD 北大核心 2016年第11期2613-2622,共10页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61272060) 教育部人文社科规划基金项目(15XJA630003) 重庆市教委科学技术研究项目(KJ1500416) 重庆市自然科学基金项目(CSTC2013jjB40003)~~
关键词 不平衡数据 欠抽样 样本权重 聚类 集成学习 imbalanced data under-sampling sample weight clustering ensemble learning
  • 相关文献

参考文献1

二级参考文献27

  • 1Kotsiantis S,Kanellopoulos D,Pintelas P.Handling Imbalanced Datasets:A Review.GESTS International Trans on Computer Science and Engineering,2006,30(1):25-36.
  • 2Burez J,van den Poel D.Handling Class Imbalance in Customer Churn Prediction.Expert Systems with Applications,2009,36(3):4626-4636.
  • 3Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:Synthetic Minority Over-Sampling Technique.Journal of Artificial Intelligence Research,2002,16(1):321-357.
  • 4Han Hui,Wang Wenyuan,Mao Binghuan.Borderline-SMOTE:A New Over-Sampling Method in Imbalanced Data Sets Learning // Proc of the International Conference on Intelligent Computing.Hefei,China,2005:878-887.
  • 5Guo Hongyu,Viktor H L.Learning from Imbalanced Data Sets with Boosting and Data Generation:the DataBoost-IM Approach.ACM SIGKDD Explorations Newsletter,2004,6(1):30-39.
  • 6Chawla N V,Lazarevic A,Hall L O,et al.SMOTEBoost:Improving Prediction of the Minority Class in Boosting // Proc of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases.Dubrovnik,Croatia,2003:107-119.
  • 7Garcìa S,Herrera F.Evolutionary Undersampling for Classification with Imbalanced Datasets:Proposals and Taxonomy.Evolutionary Computation,2009,17(3):275-306.
  • 8Joshi M V,Kumar V,Agarwal R.Evaluating Boosting Algorithms to Classify Rare Classes:Comparison and Improvements // Proc of the 1st IEEE International Conference on Data Mining.San Jose,USA,2001:257-264.
  • 9Cieslak D A,Chawla N V.Learning Decision Trees for Unbalanced Data // Proc of the European Conference on Machine Learning and Knowledge Discovery in Databases.Antwerp,Belgium,2008:241-256.
  • 10Fernández A,del Jesus M J,Herrera F.Hierarchical Fuzzy Rule Based Classification Systems with Genetic Rule Selection for Imbalanced Data-Sets.International Journal of Approximate Reasoning,2009,50(3):561-577.

共引文献27

同被引文献339

引证文献43

二级引证文献350

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部