摘要
现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法——NIBoost (New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。
The problem of misclassification of minority class samples appears frequently when classifying massive amount of imbalanced data in real life with traditional classification algorithms,because most of these algorithms only suit balanced class distribution or samples with same misclassification cost.To overcome this problem,a classification algorithm for imbalanced dataset based on cost sensitive ensemble learning and oversampling—New Imbalanced Boost(NIBoost)was proposed.Firstly,the oversampling algorithm was used to add a certain number of minority samples to balance the dataset in each iteration,and the classifier was trained on the new dataset.Secondly,the classifier was used to classify the dataset to obtain the predicted class label of each sample and the classification error rate of the classifier.Finally,the weight coefficient of the classifier and new weight of each sample were calculated according to the classification error rate and the predicted class labeles.Experimental results on UCI datasets with decision tree and Naive Bayesian used as weak classifier algorithm show that when decision tree was used as the base classifier of NIBoost,compared with RareBoost algorithm,the F-value is increased up to 5.91 percentage points,the G-mean is increased up to 7.44 percentage points,and the AUC is increased up to 4.38 percentage points.The experimental results show that the proposed algorithm has advantages on imbalanced data classification problem.
作者
王莉
陈红梅
王生武
WANG Li;CHEN Hongmei;WANG Shengwu(School of Information Science and Technology,Southwest Jiaotong University,Chengdu Sichuan 611756,China)
出处
《计算机应用》
CSCD
北大核心
2019年第3期629-633,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(61572406)~~