摘要
为缓解类不平衡问题对预测模型性能的影响,提出一种基于聚类的欠采样集成方法 CBUE(cluster-based undersampling ensemble method)。对多数类进行聚类分析,根据聚类的结果分布(即每个簇的大小比例)有放回地选择N个多数类的子集,N个子集分别和所有的少数类实例组成N个新的训练集;根据N个训练集训练出N个分类器,按照少数服从多数的原则生成一个新的集成分类器对新的数据进行预测。CBUE以NASA数据集作为评测对象,以balance、G-mean和AUC为评测指标,实验结果表明,该方法在大部分情况下要优于5种经典的基准方法 (ROS、RUS、SMOTE、RF和NB)。
To alleviate the impact of class imbalanced problem on the performance of prediction model,a cluster-based under-sampling ensemble method (CBUE)was proposed.The majority was clustered.N subsets of the majority were selected accor-ding to the distribution of clustering result which reflected the ratio of every cluster.N subsets and all minority instances were united to compose new N training sets respectively.N classifiers were trained according to N training sets and a new ensemble classifier was constructed which predicted new data based on majority rule.NASA datasets were used as evaluation datasets,and the balance,G-mean and AUC were taken as evaluation indicators.Experimental results show that the method is superior to five classical methods (ROS,RUS,SMOTE,RF and NB)in most cases.
出处
《计算机工程与设计》
北大核心
2016年第7期1805-1810,1891,共7页
Computer Engineering and Design
基金
国家自然科学基金项目(61202006
61272424)
计算机软件新技术国家重点实验室开放课题基金项目(KFKT2012B29)
江苏省自然科学基金项目(BK2010277)
江苏省科技创新基金项目(BC2013167)
江苏省高校自然科学基金项目(12KJB520014)
关键词
类不平衡学习
软件缺陷预测
集成学习方法
欠采样
聚类
class imbalance learning
software defect prediction
ensemble learning method
under-sampling
clustering