期刊文献+

一种基于级联模型的类别不平衡数据分类方法 被引量:23

A Cascade-based Classification Method for Class-imbalanced Data
下载PDF
导出
摘要 真实世界问题中,不同类别的样本在数目上往往差别很大,而传统机器学习方法难以对小类样本进行正确分类,若小类的样本是足够重要的,就会带来较大的损失.因此,对类别分布不平衡数据的学习已成为机器学习目前面临的一个挑战.受计算机视觉中级联模型的启发,提出一种针对不平衡数据的分类方法BalanceCascade.该方法逐步缩小大类别使数据集趋于平衡,在此过程中训练得到的一系列分类器通过集成方式对预测样本进行分类.实验结果表明,该方法可以有效地提高在不平衡数据上的分类性能,尤其是在分类性能受数据的不平衡性严重影响的情况下. In machine learning and data mining, there are many aspects that might influence the performance of a learning system in real world applications. Class imbalance is one of these factors, in which training examples in one class heavily outnumber the examples in another class. Classifiers generally have difficulty in learning concept from the minority class. In many applications if the minority class is more important than the majority class, there will be great loss. There is severe class imbalance in the face detection problem, which greatly decreases the detection speed. The cascade structure is proposed to accelerate the learning process. Cascade is a classifier system with a sequence of n node classifiers. At the beginning, all training examples are available to train the first node classifier. Then all positive examples and only a subset of negative examples are passed to the next node, neglecting those negative examples correctly classified by the first node. This procedure repeats until all node classifiers are trained. A test example is passed to the next node if it is recognized as positive by the current node, or is rejected immediately as negative. However, the learning goal of a cascade node classifier is quite different to usual classifiers in the sense that every node aims to get a high detection rate and only a moderate false alarm rate. The cascade can achieve both high overall detection rate and low overall false alarm rate. Every time training examples are passed to the next node, there are some negatives that are neglected. That is, there are fewer negatives in the training set than those in the previous node. Considering the class imbalance problem, it means a more balanced training set, compared with training sets in previous nodes. In early nodes within a cascade it is quite easy to achieve the learning goal, i.e. train a classifier with high detection rate and only moderate false alarm rate. However, it becomes harder in deeper nodes, since the negative examples in these nodes are false positives from previous nodes and are difficult to separate from positive examples. And there's another difference between the face detection problem and general class imbalance problems. Hundreds of thousands of features are available for classifiers in the former case, but it is not the case for the latter one. In general class imbalance problems, a classifier in a deeper node may not easily achieve both a high detection rate and a moderate false alarm rate. Therefore, cascade-style test may not he appropriate in general class imbalance problems. Instead of testing new examples in a cascade sequential style, we combine all the node classifiers into an ensemble classifier and propose a cascade-based classification algorithm, BalaneeCascade, to deal with class imbalance problems. Particularly, BalaneeCaseade employs Adaboost to train a classifier in each node, which is a weighted combination of several weak learners. Then weak learners within all node classifiers are collected to form the final ensemble without changing their original weights. Experimental results show that the method can effectively improve the classification performance on imbalaneed data sets, especially in the cases when classification performance is heavily affected by class imbalance.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2006年第2期148-155,共8页 Journal of Nanjing University(Natural Science)
基金 国家杰出青年科学基金(60325207) 江苏省自然科学基金重点项目(BK2004001) "973"国家计划(2002CB312002)
关键词 机器学习 数据挖掘 类别不平衡 级联 集成学习 machine learning, data mining, class imbalance, cascade, ensemble learning
  • 相关文献

参考文献20

  • 1Pearson R, Coney G, Shwaber J. Imbalanced clustering for microarray time-series. Proceedings of the ICML' 03 Workshop on Learning from Imbalaneed Data Sets. Washington, DC,2003.
  • 2Wu G, Chang E Y. Class-boundary alignment for imbalanced dataset learning. Proceedings of the ICML' 03 Workshop on Learning from Imbalanced Data Sets. Washington, DC, 2003.
  • 3Chan P K, Stolfo S J. Toward scalable learning with nonuniform class and cost distributions. A case study in credit card fraud detection. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. NY:AAAI Press, 1998, 164-168,
  • 4Kubat M, Holte R C, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2) :195-215.
  • 5Van den Bosch A, Weijters T, van den Herik H J, et al. When small disjuncts abound, try lazy learning: A case study. Proceedings of the 7th Belgian-Dutch Conference on Machine Learning.Tilburg: Tilburg University Press, 1997, 109- 118.
  • 6Lewis D, Catlett J. Heterogeneous uncertainty sampling for supervised learning. Proceedings for the llth International Conference of Machine Learning. New Brunswick, NJ: Morgan Kaufmann Press, 1994, 148-156.
  • 7Fawcett T. "In vivo" spam filtering: A challenge problem for data mining. SIGKDD Explorations, 2003, 5(2). 140-148.
  • 8Weiss G. Mining with rarity: A unifying framework. SIGKDD Explorations, 2004, 6 (1) : 7- 19.
  • 9Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Kauai, HI, USA, 2001: 511-518.
  • 10姜远,周志华,谢琪,陈兆乾.神经网络集成在肺癌细胞识别中的应用[J].南京大学学报(自然科学版),2001,37(5):529-534. 被引量:19

二级参考文献4

共引文献18

同被引文献258

引证文献23

二级引证文献178

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部