期刊文献+

基于聚类欠采样的集成不均衡数据分类算法 被引量:12

Imbalanced data ensemble classification based on cluster-based under-sampling algorithm
原文传递
导出
摘要 传统的分类算法大多假设数据集是均衡的,追求整体的分类精度.而实际数据集经常是不均衡的,因此传统的分类算法在处理实际数据集时容易导致少数类样本有较高的分类错误率.现有针对不均衡数据集改进的分类方法主要有两类:一类是进行数据层面的改进,用过采样或欠采样的方法增加少数类数据或减少多数类数据;另一个是进行算法层面的改进.本文在原有的基于聚类的欠采样方法和集成学习方法的基础上,采用两种方法相结合的思想,对不均衡数据进行分类.即先在数据处理阶段采用基于聚类的欠采样方法形成均衡数据集,然后用AdaBoost集成算法对新的数据集进行分类训练,并在算法集成过程中引用权重来区分少数类数据和多数类数据对计算集成学习错误率的贡献,进而使算法更关注少数数据类,提高少数类数据的分类精度. Most traditional classification algorithms assume the data set to be well-balanced and focus on achieving overall classification accuracy. However,actual data sets are usually imbalanced,so traditional classification approaches may lead to classification errors in minority class samples. With respect to imbalanced data,there are two main methods for improving classification performance. The first is to improve the data set by increasing the number of minority class samples by over-sampling and decreasing the number of majority class samples by under-sampling. The other method is to improve the algorithm itself. By combining the cluster-based under-sampling method with ensemble classification,in this paper,an approach was proposed for classifying imbalanced data. First,the cluster-based under-sampling method is used to establish a balanced data set in the data processing stage,and then the new data set is trained by the Ada Boost ensemble algorithm. In the integration process,when calculating the error rate of integrated learning,this algorithm uses weights to distinguish minority class data from majority class data. This makes the algorithm focus more on small data classes,thereby improving the classification accuracy of minority class data.
出处 《工程科学学报》 EI CSCD 北大核心 2017年第8期1244-1253,共10页 Chinese Journal of Engineering
基金 国家自然科学基金资助项目(71271027) 高等学校博士学科点专项科研基金资助项目(20120006110037)
关键词 不均衡数据 欠采样 聚类 集成学习 imbalanced data under-sampling classification ensemble learning
  • 相关文献

参考文献5

二级参考文献74

  • 1郑恩辉,李平,宋执环.代价敏感支持向量机[J].控制与决策,2006,21(4):473-476. 被引量:33
  • 2毛勇,周晓波,夏铮,尹征,孙优贤.特征选择算法研究综述[J].模式识别与人工智能,2007,20(2):211-218. 被引量:94
  • 3He Haibo,Garcia E A.Learning from Imbalanced Data[J].IEEETransactions on Knowledge and Data Engineering,2009,21(9):1263-1284.
  • 4Chawla N V,Bowyer K,Hall L,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:231-357.
  • 5Gu Qiong,Cai Zhihua,Zhu Li,et al.Data Mining on Imbalanced DataSets[C]//International Conference on Advanced Computer Theory andEngineering,2008:1020-1024.
  • 6Lewis D,Gale W.A sequential Algorithm for Training Text Classifiers[C]//Proc of the17th Annual IACM-SIGIR Conf.on Research and development in information retrieval,1994:3-12.
  • 7Liu Xuying,Wu Jianxin,Zhou Zhihua.Undersampling for Class-Imbal-ance Learning[J].IEEE Transactions on Systems,MAN,and Cyber-netics-Part B:Cybernetics,2009,39(2):539-550.
  • 8Fawcett T.ROC graphs:Notes and practical considerations for research-ers[R].HP Labs,Palo Alto,CA,Tech.Rep.HPL-2003-4,2003.
  • 9Guo H,Viktor H L.Learning from imbalanced data sets with boosting and data generation:The DataBoost IM approach[J].ACM SIGKDD Explorations,2004,6(1):30-39.
  • 10Tan Pangning,Steinbach M,Kumar V.数据挖掘导论[M].范明,范宏建,译.北京:人民邮电出版社,2006:241-327

共引文献104

同被引文献84

引证文献12

二级引证文献55

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部