期刊文献+

新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost 被引量:11

NIBoost: new imbalanced dataset classification method based on cost sensitive ensemble learning
下载PDF
导出
摘要 现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法——NIBoost (New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。 The problem of misclassification of minority class samples appears frequently when classifying massive amount of imbalanced data in real life with traditional classification algorithms,because most of these algorithms only suit balanced class distribution or samples with same misclassification cost.To overcome this problem,a classification algorithm for imbalanced dataset based on cost sensitive ensemble learning and oversampling—New Imbalanced Boost(NIBoost)was proposed.Firstly,the oversampling algorithm was used to add a certain number of minority samples to balance the dataset in each iteration,and the classifier was trained on the new dataset.Secondly,the classifier was used to classify the dataset to obtain the predicted class label of each sample and the classification error rate of the classifier.Finally,the weight coefficient of the classifier and new weight of each sample were calculated according to the classification error rate and the predicted class labeles.Experimental results on UCI datasets with decision tree and Naive Bayesian used as weak classifier algorithm show that when decision tree was used as the base classifier of NIBoost,compared with RareBoost algorithm,the F-value is increased up to 5.91 percentage points,the G-mean is increased up to 7.44 percentage points,and the AUC is increased up to 4.38 percentage points.The experimental results show that the proposed algorithm has advantages on imbalanced data classification problem.
作者 王莉 陈红梅 王生武 WANG Li;CHEN Hongmei;WANG Shengwu(School of Information Science and Technology,Southwest Jiaotong University,Chengdu Sichuan 611756,China)
出处 《计算机应用》 CSCD 北大核心 2019年第3期629-633,共5页 journal of Computer Applications
基金 国家自然科学基金资助项目(61572406)~~
关键词 非平衡数据集 分类 代价敏感 过采样 ADABOOST算法 imbalanced dataset classification cost sensitive over-sampling Adaboost algorithm
  • 相关文献

参考文献3

二级参考文献41

  • 1凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 2Bartlett P L, Traskin M. AdaBoost is consistent. Journal of Machine Learning Research, 2007, 8:2347-2368.
  • 3Schapire R E. The convergence rate of AdaBoost [open prob lem]//Proceedings of the 23rd Conference on Learning Theo ry. Haifa, Israel, 2010.
  • 4Japkowicz N. Learning from imbalanced data sets: A com parison of various strategies/ /Proceedings of the AAAI 2000 Workshop, 2000:10-15.
  • 5Chawla N V, Japkowicz N, Kotcz A. Workshop on learning from imbalanced data sets//Proceedings of the ICML' 2003. Washington, DC, USA, 2003.
  • 6Chawla N V, Japkowicz N, Kolez A. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Ex- plorations Newsletter, 2004, 6 (1) : 1-6.
  • 7He Hai-Bo, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
  • 8Liu X Y, Zhou Z H. The influence of class imbalance on cost-sensitive learning: An empirical study//Proeeedings of the 6th International Conference on Data Mining(ICDM'06). Hong Kong, China, 2006 : 970-974.
  • 9Wang B X, Japkowicz N. Boosting support vector machines for imbalanced data sets. Lecture Notes in Artificial Intelli- gence, 2008, 4994: 38-47.
  • 10Ertekin S, Huang J, Bottou L, Giles L. Learning on the border: active learning in imbalanced data classification// Proceedings of the ACM Conference on Information and Knowledge Management. Lisbon, Portugal, 2007: 127-136.

共引文献88

同被引文献98

引证文献11

二级引证文献45

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部