一种基于级联模型的类别不平衡数据分类方法被引量：23

A Cascade-based Classification Method for Class-imbalanced Data

下载PDF

导出

摘要真实世界问题中,不同类别的样本在数目上往往差别很大,而传统机器学习方法难以对小类样本进行正确分类,若小类的样本是足够重要的,就会带来较大的损失.因此,对类别分布不平衡数据的学习已成为机器学习目前面临的一个挑战.受计算机视觉中级联模型的启发,提出一种针对不平衡数据的分类方法BalanceCascade.该方法逐步缩小大类别使数据集趋于平衡,在此过程中训练得到的一系列分类器通过集成方式对预测样本进行分类.实验结果表明,该方法可以有效地提高在不平衡数据上的分类性能,尤其是在分类性能受数据的不平衡性严重影响的情况下. In machine learning and data mining, there are many aspects that might influence the performance of a learning system in real world applications. Class imbalance is one of these factors, in which training examples in one class heavily outnumber the examples in another class. Classifiers generally have difficulty in learning concept from the minority class. In many applications if the minority class is more important than the majority class, there will be great loss. There is severe class imbalance in the face detection problem, which greatly decreases the detection speed. The cascade structure is proposed to accelerate the learning process. Cascade is a classifier system with a sequence of n node classifiers. At the beginning, all training examples are available to train the first node classifier. Then all positive examples and only a subset of negative examples are passed to the next node, neglecting those negative examples correctly classified by the first node. This procedure repeats until all node classifiers are trained. A test example is passed to the next node if it is recognized as positive by the current node, or is rejected immediately as negative. However, the learning goal of a cascade node classifier is quite different to usual classifiers in the sense that every node aims to get a high detection rate and only a moderate false alarm rate. The cascade can achieve both high overall detection rate and low overall false alarm rate. Every time training examples are passed to the next node, there are some negatives that are neglected. That is, there are fewer negatives in the training set than those in the previous node. Considering the class imbalance problem, it means a more balanced training set, compared with training sets in previous nodes. In early nodes within a cascade it is quite easy to achieve the learning goal, i.e. train a classifier with high detection rate and only moderate false alarm rate. However, it becomes harder in deeper nodes, since the negative examples in these nodes are false positives from previous nodes and are difficult to separate from positive examples. And there＇s another difference between the face detection problem and general class imbalance problems. Hundreds of thousands of features are available for classifiers in the former case, but it is not the case for the latter one. In general class imbalance problems, a classifier in a deeper node may not easily achieve both a high detection rate and a moderate false alarm rate. Therefore, cascade-style test may not he appropriate in general class imbalance problems. Instead of testing new examples in a cascade sequential style, we combine all the node classifiers into an ensemble classifier and propose a cascade-based classification algorithm, BalaneeCascade, to deal with class imbalance problems. Particularly, BalaneeCaseade employs Adaboost to train a classifier in each node, which is a weighted combination of several weak learners. Then weak learners within all node classifiers are collected to form the final ensemble without changing their original weights. Experimental results show that the method can effectively improve the classification performance on imbalaneed data sets, especially in the cases when classification performance is heavily affected by class imbalance.

作者刘胥影吴建鑫周志华

机构地区南京大学软件新技术国家重点实验室佐治亚理工学院计算机学院

出处《南京大学学报（自然科学版）》 CAS CSCD 北大核心 2006年第2期148-155,共8页 Journal of Nanjing University（Natural Science）

基金国家杰出青年科学基金(60325207) 江苏省自然科学基金重点项目(BK2004001) "973"国家计划(2002CB312002)

关键词机器学习数据挖掘类别不平衡级联集成学习 machine learning, data mining, class imbalance, cascade, ensemble learning

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献20

1Pearson R, Coney G, Shwaber J. Imbalanced clustering for microarray time-series. Proceedings of the ICML' 03 Workshop on Learning from Imbalaneed Data Sets. Washington, DC,2003.
2Wu G, Chang E Y. Class-boundary alignment for imbalanced dataset learning. Proceedings of the ICML' 03 Workshop on Learning from Imbalanced Data Sets. Washington, DC, 2003.
3Chan P K, Stolfo S J. Toward scalable learning with nonuniform class and cost distributions. A case study in credit card fraud detection. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. NY:AAAI Press, 1998, 164-168,
4Kubat M, Holte R C, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2) :195-215.
5Van den Bosch A, Weijters T, van den Herik H J, et al. When small disjuncts abound, try lazy learning: A case study. Proceedings of the 7th Belgian-Dutch Conference on Machine Learning.Tilburg: Tilburg University Press, 1997, 109- 118.
6Lewis D, Catlett J. Heterogeneous uncertainty sampling for supervised learning. Proceedings for the llth International Conference of Machine Learning. New Brunswick, NJ: Morgan Kaufmann Press, 1994, 148-156.
7Fawcett T. "In vivo" spam filtering: A challenge problem for data mining. SIGKDD Explorations, 2003, 5(2). 140-148.
8Weiss G. Mining with rarity: A unifying framework. SIGKDD Explorations, 2004, 6 (1) : 7- 19.
9Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Kauai, HI, USA, 2001: 511-518.
10姜远,周志华,谢琪,陈兆乾.神经网络集成在肺癌细胞识别中的应用[J].南京大学学报（自然科学版）,2001,37(5):529-534. 被引量：19

二级参考文献4

1周志华,陈兆乾,邵栋,陈世福.快速自适应分类器FTART2的设计与应用[J].模式识别与人工智能,1999,12(3):268-274. 被引量：4
2陆新泉,李宁,陈世福,叶玉坤.形态学和色度学在肺癌早期诊断系统中的研究与实现[J].模式识别与人工智能,2000,13(1):116-120. 被引量：8
3周志华,陈兆乾,netra.nju.edu.cn,陈世福.基于域理论的自适应谐振神经网络分类器[J].软件学报,2000,11(5):667-672. 被引量：13
4周志华,陈兆乾,陈世福.基于域理论的自适应谐振神经网络研究[J].南京大学学报（自然科学版）,2000,36(2):140-147. 被引量：3

共引文献18

1丁杰,魏敏文,代洁,陈庆琪,张金华,刘建平,吴志坚,卢强,乔宏.数据仓库和数据挖掘技术在湖南水库调度中的应用[J].水电自动化与大坝监测,2005,29(3):12-15. 被引量：6
2卜宪敏,郑智勇,余英豪.神经网络在病理学自动识别中的应用[J].中国体视学与图像分析,2005,10(2):122-126. 被引量：6
3王正群,陈世福,陈兆乾.基于模糊划分的神经网络集成[J].南京大学学报（自然科学版）,2006,42(1):63-68. 被引量：6
4谢华,夏顺仁,张赞超.医学图像识别中多分类器融合方法的研究进展[J].国际生物医学工程杂志,2006,29(3):152-157. 被引量：10
5吴建生.最小一乘回归神经网络集成方法股市建模研究[J].计算机工程与设计,2007,28(23):5812-5815. 被引量：4
6肖迪,张广明.基于粗糙集理论的肺癌细胞图像识别[J].南京工业大学学报（自然科学版）,2007,29(6):87-90. 被引量：3
7蒋英,李晖.神经网络在白血病图像识别中的应用[J].福建电脑,2008,24(3):5-5.
8吴春梅,吴建生,邓丽,罗芳琼.基于遗传算法优化投影寻踪技术的神经网络集成模型及其应用[J].计算机应用与软件,2009,26(8):115-119.
9余滨,丁春,魏善波,陈邦华,刘普林,罗同勇,王家刚,潘志伟,陆君安.神经网络在麻疹预测预警中的应用[J].中华流行病学杂志,2011,32(1):73-76. 被引量：6
10郝维来,郑同山.基于AdaBoost的集成分类器在电信增值业务中的应用[J].计算机技术与发展,2011,21(3):197-199. 被引量：1

同被引文献258

1邹磊,卢炎生,崔得暄,胡蓉.一种基于最小损失的垃圾邮件屏蔽算法[J].华中科技大学学报（自然科学版）,2005,33(z1):352-355. 被引量：2
2项贻强,李毅,周畅,周逊盛.桥梁结构在线健康监测预警系统Ⅰ——监测评估预警体系和模块设计[J].长沙交通学院学报,2009,25(1):26-31. 被引量：10
3徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量：20
4王颖,谢剑英.一种自适应蚁群算法及其仿真研究[J].系统仿真学报,2002,14(1):31-33. 被引量：232
5范明,刘孟旭,赵红领.一种基于基本显露模式的分类算法[J].计算机科学,2004,31(11):211-214. 被引量：11
6辛宪会,郭建星,解志刚,邱振戈.一种基于支持向量机的纹理图像分类法[J].海洋测绘,2005,25(2):41-43. 被引量：8
7王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量：129
8薛佩军,管延勇.正负域覆盖广义粗集及其运算公理化[J].计算机工程与应用,2005,41(27):35-37. 被引量：5
9张琦,吴斌,王柏.非平衡数据训练方法概述[J].计算机科学,2005,32(10):181-186. 被引量：10
10刘刚,张洪刚,郭军.不同训练样本对识别系统的影响[J].计算机学报,2005,28(11):1923-1928. 被引量：15

引证文献23

1林舒杨,李翠华,江弋,林琛,邹权.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3):47-53. 被引量：31
2辛宪会,叶秋果,滕惠忠,郭思海,李军,张靓,韩晓宏.小样本机器学习算法的特性分析与应用[J].海洋测绘,2007,27(3):16-19. 被引量：2
3廖志芳,陈宇宙,樊晓平,瞿志华.面向非平衡混合数据的改进计数最近邻分类算法[J].计算机工程与应用,2008,44(12):139-141. 被引量：2
4李艳,范明.基于基本显露模式的电子邮件分类与过滤技术[J].南京大学学报（自然科学版）,2008,44(5):544-550. 被引量：3
5胡军,王国胤.覆盖粒度空间的层次模型[J].南京大学学报（自然科学版）,2008,44(5):551-558. 被引量：20
6李颖新,姜远,周志华.基于基因表达谱的肿瘤样本分类规则提取[J].南京大学学报（自然科学版）,2009,45(5):613-619. 被引量：1
7李明方,张化祥,张雯,计华.不平衡数据集的神经网络阈值优化方法[J].计算机工程与应用,2010,46(20):168-171. 被引量：2
8邹权,郭茂祖,刘扬,王峻.类别不平衡的分类方法及在生物信息学中的应用[J].计算机研究与发展,2010,47(8):1407-1414. 被引量：26
9王璐,邱桃荣,何妞,刘萍.基于粗糙集和蚁群优化算法的特征选择方法[J].南京大学学报（自然科学版）,2010,46(5):487-493. 被引量：19
10郭丽娜,杨杨.一种基于模糊支持向量机软件模块缺陷检测算法[J].南京大学学报（自然科学版）,2012,48(2):221-227. 被引量：2

二级引证文献178

1沈夏炯,薛钰,韩道军,张磊.访问控制系统中客体粒度决策方法研究[J].河南大学学报（自然科学版）,2020,0(1):63-69. 被引量：1
2石梦鑫,智佳,高翔,杨甲森.基于集成学习的遥测数据互相关结构知识发现[J].北京航空航天大学学报,2020,46(1):181-188. 被引量：3
3林舒杨,李翠华,江弋,林琛,邹权.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3):47-53. 被引量：31
4胡军,王国胤.覆盖粗糙集的模糊度[J].重庆邮电大学学报（自然科学版）,2009,21(4):490-493. 被引量：11
5王中锋,王志海,付彬.一种局部打分搜索型限制性贝叶斯网络结构学习算法[J].南京大学学报（自然科学版）,2009,45(5):656-664. 被引量：5
6王娜,侯爽.K-最近邻分类技术的新发展与技术改进[J].河北省科学院学报,2009,26(4):11-13. 被引量：5
7徐久成,张倩倩.覆盖粗糙Vague集的不确定性度量研究[J].计算机科学,2010,37(10):225-227. 被引量：5
8王石平,祝峰,朱培勇.基于抽象相关关系的粗糙集研究[J].南京大学学报（自然科学版）,2010,46(5):507-510. 被引量：4
9何富贵,张燕平,张铃.基于社团为粒度的网络分割方法[J].南京大学学报（自然科学版）,2010,46(5):511-519. 被引量：4
10陈金坦,康恒政,杨燕,周伟雄.一种用于不平衡数据的分类算法[J].山东大学学报（工学版）,2011,41(2):96-101. 被引量：1

1林舒杨,李翠华,江弋,林琛,邹权.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3):47-53. 被引量：31
2李倩倩,刘胥影.多类类别不平衡学习算法:EasyEnsemble.M[J].模式识别与人工智能,2014,27(2):187-192. 被引量：16
3胡小生,张润晶,钟勇.两层聚类的类别不平衡数据挖掘算法[J].计算机科学,2013,40(11):271-275. 被引量：6
4孙茂松.应用自然语言处理技术解决真实世界问题[J].国际学术动态,1998(2):33-35. 被引量：1
5刘征.软件开发框架模型——现实世界问题的结构化分析[J].电脑编程技巧与维护,2011(4):3-4. 被引量：2
6黄莉,梁云,黄凤,姚继明.基于级联模型的输变电设备状态图像分类方法[J].信息技术,2015,39(6):28-31. 被引量：1
7韩志艳,王健.基于加权合成少数类过采样技术的故障诊断[J].计算机技术与发展,2016,26(9):43-46. 被引量：1
8黄鹏飞,张道强.拉普拉斯加权聚类算法[J].电子学报,2008,36(B12):50-54. 被引量：5
9施培蓓,刘贵全,汪中,卫兵.一种基于类别不平衡数据的层次分类模型[J].中国科学技术大学学报,2015,45(1):61-68. 被引量：4
10王之怡,杨一帆.多分类簇支持向量机方法[J].计算机应用,2010,30(1):143-145. 被引量：1

南京大学学报（自然科学版）

2006年第2期

浏览历史

内容加载中请稍等...

一种基于级联模型的类别不平衡数据分类方法被引量：23

参考文献20

二级参考文献4

共引文献18

同被引文献258

引证文献23

二级引证文献178

相关作者

相关机构

相关主题

浏览历史

一种基于级联模型的类别不平衡数据分类方法 被引量：23

参考文献20

二级参考文献4

共引文献18

同被引文献258

引证文献23

二级引证文献178

相关作者

相关机构

相关主题

浏览历史

一种基于级联模型的类别不平衡数据分类方法被引量：23