Using boosting tree to learn imbalanced data

Using boosting tree to learn imbalanced data

导出

摘要 In case of machine learning,the problem of class imbalance is always troubling,i.e.one class of the samples has a larger magnitude than the other classes.This problem brings a preference of the classifier to the majority class,which leads to worse performance of the classifier on the minority class.We proposed an improved boosting tree(BT) algorithm for learning imbalanced data,called cost BT.In each iteration of the cost BT,only the weights of the misclassified minority class samples are increased.Meanwhile,the error rate in the weight formula of the base classifier is replaced by 1 minus F-measure.In this study,the performance of the cost BT algorithm is compared with other known methods on 9 public data sets.The compared methods include the decision tree and random forest algorithm,and both of them were combined with the sampling techniques such as synthetic minority oversampling technique(SMOTE),Borderline-SMOTE,adaptive synthetic sampling approach(ADASYN) and one sided selection.The cost BT algorithm performed better than the other compared methods in F-measure,G-mean and area under curve(AUC).In 6 of the 9 data sets,the cost BT algorithm has a superior performance to the other published methods.It can promote the prediction performance of the base classifiers by increasing the proportion of the minority class in the whole samples with only increasing the weights of the misclassified minority class samples in each iteration of the BT.In addition,computing the weights of the base classifiers with F-measure is helpful to the ensemble decisions. In case of machine learning, the problem of class imbalance is always troubling, i.e. one class of the samples has a larger magnitude than the other classes. This problem brings a preference of the classifier to the majority class, which leads to worse performance of the classifier on the minority class. We proposed an improved boosting tree(BT) algorithm for learning imbalanced data, called cost BT. In each iteration of the cost BT, only the weights of the misclassified minority class samples are increased. Meanwhile, the error rate in the weight formula of the base classifier is replaced by 1 minus F-measure. In this study, the performance of the cost BT algorithm is compared with other known methods on 9 public data sets. The compared methods include the decision tree and random forest algorithm, and both of them were combined with the sampling techniques such as synthetic minority oversampling technique(SMOTE), Borderline-SMOTE, adaptive synthetic sampling approach(ADASYN) and one sided selection. The cost BT algorithm performed better than the other compared methods in F-measure, G-mean and area under curve(AUC). In 6 of the 9 data sets, the cost BT algorithm has a superior performance to the other published methods. It can promote the prediction performance of the base classifiers by increasing the proportion of the minority class in the whole samples with only increasing the weights of the misclassified minority class samples in each iteration of the BT. In addition, computing the weights of the base classifiers with F-measure is helpful to the ensemble decisions.

作者 Yang Ridong Zhang Shiyu Li Lin Wang Zhe Zhou Yi

机构地区 Zhongshan School of Medicine College of Public Health

出处《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2019年第2期43-51,81,共10页 中国邮电高校学报（英文版）

基金 supported by the National Key Research and Development Program of China(2018YFC0116902,2016YFC0901602) the National Natural Science Foundation of China(NSFC)(61876194) Joint Foundation for the NSFC and Guangdong Science Center for Big Data(U1611261) Medical Scientific Research Foundation of Guangdong Province of China(C2017037) Science and Technology Program of Guangzhou(201604020016)

关键词 MACHINE learning CLASS imbalanced BT data sampling machine learning class imbalanced BT data sampling

分类号 TN [电子电信]

引文网络
相关文献

1沈照峰,吴本升,周青.非药物疗法对炎症性肠病患者生存质量影响的Meta分析[J].中国中西医结合消化杂志,2018,26(10):845-850. 被引量：2
2Haoyue Liu,MengChu Zhou,Qing Liu.An Embedded Feature Selection Method for Imbalanced Data Classification[J].IEEE/CAA Journal of Automatica Sinica,2019,6(3):703-715. 被引量：14
3孙立研,刘美玲,周礼祥,于洋.基于气象因子深度学习的森林火灾预测方法[J].林业工程学报,2019,4(3):132-136. 被引量：15
4郑怡鹏.基于Landsat遥感影像的土地利用变化动态监测[J].软件,2019,40(6):200-203. 被引量：5
5喻艳霞.支架理论在初中英语阅读教学中的运用——以Should I Be Allowed to Make My Own Decisions?为例[J].英语教师,2019,19(15):147-150.
6Young Joo Yang,Chang Seok Bang.Application of artificial intelligence in gastroenterology[J].World Journal of Gastroenterology,2019,25(14):1666-1683. 被引量：30
7Mayowa Oyesanya,Javier Lopez-Morinigo,Rina Dutta.Systematic review of suicide in economic recession[J].World Journal of Psychiatry,2015,5(2):243-254. 被引量：1
8林玉涓,黄淑卿,钟逢逍.探讨RT-3DE评估右心衰竭患者右室收缩功能的临床价值[J].中医临床研究,2019,11(11):24-26.
9Xiaogang Lin,Chuanying Chen,Zhaozhan Lin,Yongwu Zhou.Pricing and Service Strategies for Two-sided Platforms[J].Journal of Systems Science and Systems Engineering,2019,28(3):299-316. 被引量：3
10PREPARATION[J].Asian Journal of Pharmaceutical Sciences,2017,12(2):209-209.

The Journal of China Universities of Posts and Telecommunications

2019年第2期

浏览历史

内容加载中请稍等...

Using boosting tree to learn imbalanced data

相关作者

相关机构

相关主题

浏览历史