期刊文献+

分类不平衡对软件缺陷预测模型性能的影响研究 被引量:28

The Impact Study of Class Imbalance on the Performance of Software Defect Prediction Models
下载PDF
导出
摘要 分类不平衡是指不同类别间样本数量分布不均衡的现象.在软件缺陷预测中,传统预测模型的性能可能会因数据集分类不平衡而受到影响.为了探究分类不平衡对软件缺陷预测模型性能的影响程度,该文提出一种分类不平衡影响分析方法.首先,设计一种新数据集构造算法,将原不平衡数据集转化为一组不平衡率依次递增的新数据集.然后,选取不同的分类模型作为缺陷预测模型,分别对构造的新数据集进行预测,并采用AUC指标来度量不同预测模型的分类性能.最后,采用变异系数C·V来评价各个预测模型在分类不平衡时的性能稳定程度.在8种典型的预测模型上进行实验验证,结果表明C4.5、RIPPER和SMO这3种预测模型的性能随着不平衡率的增大而下降,而代价敏感学习和集成学习能够有效提高它们在分类不平衡时的性能和性能稳定程度.与上述3种模型相比,逻辑回归、朴素贝叶斯和随机森林等模型的性能更加稳定. Class imbalance refers to that the number of samples in different classes is unbalanced.In the process of software defect prediction,the performance of traditional prediction models may be affected by the class imbalance problem of datasets.In order to explore the impact of class imbalance on the performance of software defect prediction models,this paper presents an approach to analyzing the impact of class imbalance.First,an algorithm is designed to construct new datasets,which could convert an original imbalanced dataset into a set of new datasets with imbalance ratio increased one by one.Second,different classification models are selected as the defect prediction models to predict on these new constructed datasets respectively.Moreover,AUC metric is used to measure the classification performance of different prediction models.Finally,Coefficient of Variation(C·V)is applied to evaluate the performance stability of each prediction model with class imbalance.The empirical study is conducted on eight typical prediction models.The results show that the performance of three prediction models,C4.5,RIPPER and SMO,are decreased with the increasing of imbalance ratio.However,cost-sensitive learning and ensemble learning could improve their performance and performance stability with class imbalance.Compared with the above three models,the performance of Logistic Regression,Naive Bayes and Random Forest models are more stable.
作者 于巧 姜淑娟 张艳梅 王兴亚 高鹏飞 钱俊彦 YU Qiao;JIANG Shu-Juan;ZHANG Yan-Mei;WANG Xing-Ya;GAO Peng-Fei;QIAN Jun-Yan(School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116;Guangxi Key Laboratory of Trusted Software,Guilin University of Electronic Technology,Guilin,Guangxi 541004;State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023)
出处 《计算机学报》 EI CSCD 北大核心 2018年第4期809-824,共16页 Chinese Journal of Computers
基金 国家自然科学基金(61673384 61502497 61562015) 广西可信软件重点实验室研究课题(kx201530) 南京大学计算机软件新技术国家重点实验室开放课题(KFKT2014B19) 江苏省普通高校研究生科研创新计划项目(KYLX15_1443) 国家级大学生创新项目(201510290001)资助
关键词 分类不平衡 软件缺陷预测 预测模型 不平衡率 代价敏感学习 集成学习 class imbalance software defect prediction prediction models imbalance ratio cost-sensitive learning ensemble learning
  • 相关文献

参考文献2

二级参考文献36

  • 1凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 2Bartlett P L, Traskin M. AdaBoost is consistent. Journal of Machine Learning Research, 2007, 8:2347-2368.
  • 3Schapire R E. The convergence rate of AdaBoost [open prob lem]//Proceedings of the 23rd Conference on Learning Theo ry. Haifa, Israel, 2010.
  • 4Japkowicz N. Learning from imbalanced data sets: A com parison of various strategies/ /Proceedings of the AAAI 2000 Workshop, 2000:10-15.
  • 5Chawla N V, Japkowicz N, Kotcz A. Workshop on learning from imbalanced data sets//Proceedings of the ICML' 2003. Washington, DC, USA, 2003.
  • 6Chawla N V, Japkowicz N, Kolez A. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Ex- plorations Newsletter, 2004, 6 (1) : 1-6.
  • 7He Hai-Bo, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
  • 8Liu X Y, Zhou Z H. The influence of class imbalance on cost-sensitive learning: An empirical study//Proeeedings of the 6th International Conference on Data Mining(ICDM'06). Hong Kong, China, 2006 : 970-974.
  • 9Wang B X, Japkowicz N. Boosting support vector machines for imbalanced data sets. Lecture Notes in Artificial Intelli- gence, 2008, 4994: 38-47.
  • 10Ertekin S, Huang J, Bottou L, Giles L. Learning on the border: active learning in imbalanced data classification// Proceedings of the ACM Conference on Information and Knowledge Management. Lisbon, Portugal, 2007: 127-136.

共引文献73

同被引文献185

引证文献28

二级引证文献107

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部