摘要
集成学习由于泛化能力强,被广泛应用于信息检索、图像处理、生物学等类别不平衡的场景。为了提高算法在不平衡数据上的分类效果,文中提出一种基于采样平衡和特征选择的集成学习算法OBPD-EFSBoost。该算法主要包括3个步骤:首先,依据少数类高斯混合分布得到的概率模型,进行过采样构造平衡数集,扩大少数类的潜在决策域;其次,每轮训练个体分类器时,根据上一轮的错分样本综合考虑样本和特征的加权,过滤冗余噪声特征;最后,通过个体分类器的加权投票得到最终的集成分类器。8组UCI数据分类结果表明,该算法不仅有效提高了少数类的分类精度,同时还弥补了Boosting类算法对噪声特征敏感的缺陷,具有较强的鲁棒性。
Ensemble learning has attracted wide attention in imbalanced category circumstances such as information retrieval,image processing,and biology due to its generalization ability.To improve the performance of classification algorithm on imbalanced data,this paper proposed an ensemble learning algorithm,namely Oversampling Based on Probabi-lity Distribution-Embedding Feature Selection in Boosting(OBPD-EFSBoost).This algorithm mainly includes three steps.Firstly,the original data are oversampled based on probability distribution estimation to construct a balanced dataset.Secondly,when training base classifiers in each round,OBPD-EFSBoost increases the weight of misclassified samples,and considers the effect of noise feature on classification results,thus filtering the redundant noise feature.Finally,the eventual ensemble classifier is obtained through weighted voting on different base classifiers.Experimental results show that the algorithm not only improves the classification accuracy for minority class,but also eliminates the sensitivity of Boosting to noise features,and it has strong robustness.
作者
曹雅茜
黄海燕
CAO Ya-xi;HUANG Hai-yan(Key Laboratory of Advanced Process Control and Optimization for Chemical Processes (East China University of Science and Technology),Ministry of Education,Shanghai 200237,China)
出处
《计算机科学》
CSCD
北大核心
2019年第5期203-208,共6页
Computer Science
关键词
不平衡数据分类
集成学习
特征选择
概率分布
Imbalanced data classification
Ensemble learning
Feature selection
Probability distribution