期刊文献+

基于概率采样和集成学习的不平衡数据分类算法 被引量:13

Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning
下载PDF
导出
摘要 集成学习由于泛化能力强,被广泛应用于信息检索、图像处理、生物学等类别不平衡的场景。为了提高算法在不平衡数据上的分类效果,文中提出一种基于采样平衡和特征选择的集成学习算法OBPD-EFSBoost。该算法主要包括3个步骤:首先,依据少数类高斯混合分布得到的概率模型,进行过采样构造平衡数集,扩大少数类的潜在决策域;其次,每轮训练个体分类器时,根据上一轮的错分样本综合考虑样本和特征的加权,过滤冗余噪声特征;最后,通过个体分类器的加权投票得到最终的集成分类器。8组UCI数据分类结果表明,该算法不仅有效提高了少数类的分类精度,同时还弥补了Boosting类算法对噪声特征敏感的缺陷,具有较强的鲁棒性。 Ensemble learning has attracted wide attention in imbalanced category circumstances such as information retrieval,image processing,and biology due to its generalization ability.To improve the performance of classification algorithm on imbalanced data,this paper proposed an ensemble learning algorithm,namely Oversampling Based on Probabi-lity Distribution-Embedding Feature Selection in Boosting(OBPD-EFSBoost).This algorithm mainly includes three steps.Firstly,the original data are oversampled based on probability distribution estimation to construct a balanced dataset.Secondly,when training base classifiers in each round,OBPD-EFSBoost increases the weight of misclassified samples,and considers the effect of noise feature on classification results,thus filtering the redundant noise feature.Finally,the eventual ensemble classifier is obtained through weighted voting on different base classifiers.Experimental results show that the algorithm not only improves the classification accuracy for minority class,but also eliminates the sensitivity of Boosting to noise features,and it has strong robustness.
作者 曹雅茜 黄海燕 CAO Ya-xi;HUANG Hai-yan(Key Laboratory of Advanced Process Control and Optimization for Chemical Processes (East China University of Science and Technology),Ministry of Education,Shanghai 200237,China)
出处 《计算机科学》 CSCD 北大核心 2019年第5期203-208,共6页 Computer Science
关键词 不平衡数据分类 集成学习 特征选择 概率分布 Imbalanced data classification Ensemble learning Feature selection Probability distribution
  • 相关文献

参考文献1

二级参考文献18

  • 1He H, Garcia E A. Learning from imbalanced data[J]. IEEE Trans on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
  • 2Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. J of Artificial Intelligence Research, 2002, 6(1): 321-357.
  • 3Cao P, Zhao D, Zaiane O. An optimized cost-sensitive SVM for imbalanced data learning[C]. Proc of the 17th Pacific-Asia Conf on Knowledge Discovery and Data Mining. Gold Coast, 2013: 280-292.
  • 4Weiss G. The impact of small disjuncts on classifier learning[J]. Annals of Information Systems, 2010, 8(1): 193-226.
  • 5Jo T, Japkowicz N. Class imbalances versus small disjuncts[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 40-49.
  • 6Japkowicz N. Concept-learning in the presence of between- class and within-class imbalances[C]. Proc of Advances in Artificial Intelligence. Adelaide, 2001: 67-77.
  • 7Titterington D M, Smith A F M, Makov U E. Statistical analysis of finite mixture distributions[M]. New York: John Wiley Sons, 2001.
  • 8Laurikkala J. Improving identification of difficult small classes by balancing class distribution[C]. Proc of AI in Medicine in Europe: Artificial Intelligence Medicine. Cascais, 2001: 63-66.
  • 9Barua S, Md I, Kazuyuki M. A novel synthetic minority oversampling technique for imbalanced data set learning[C]. Proc of the 18th Int Conf on Neural Information Processing. Shanghai, 2011: 735-744.
  • 10Figueiredo M A T, Jain A K. Unsupervised learning of finite mixture models[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24(3): 381-396.

共引文献5

同被引文献165

引证文献13

二级引证文献72

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部