基于统计抽样的非均衡分类方法在软件缺陷预测中的应用被引量：1

APPLYING STATISTICAL SAMPLING-BASED IMBALANCED CLASSIFICATION IN SOFTWARE DEFECT PREDICTION

下载PDF

导出

摘要目前软件缺陷预测的研究主要是从历史数据获取来源和预测方法这两方面入手。然而,获取到的软件历史缺陷数据往往是非均衡的,传统的预测方法会给缺陷数据带来极大的误分率。针对这一问题,提出使用基于统计抽样的非均衡分类方法来预测软件缺陷。通过经验性地对比分析12种已有抽样与分类算法组合的预测性能优劣,得到Spread Subsampling和随机森林结合的方法(SP-RF)综合表现最好,但具有较高伪正率(FPR)。为了进一步提高预测性能,针对原始SP-RF方法会对原始数据带来较大的噪音及信息缺失等不足,提出一种基于SP-RF的内置均衡化抽样的自适应随机森林改进算法(IBSBA-RF)。实验表明,IBSBA-RF算法可以显著降低预测结果的FPR,并且进一步提高了预测结果的AUC和Balance值。 Currently the researches of software defect prediction（ SDP） are mainly conducted in two aspects of source acquisition from historical data and prediction methods. Unfortunately,the data of historical software defects we got are basically class imbalanced,traditional prediction methods will result in high misclassification of the defects data. To solve this problem,we propose to use an imbalanced classification method based on statistical sampling for software defect prediction. By comparing and analysing empirically the pros and cons in prediction performances of 12 combined algorithms consisting of ready samples and classifications,we derive that the SP-RF（ Spread Subsampling combining with random forest） method shows the best overall performance,but a little weakness in false positive ratio（ FPR）. To further improve the prediction performance of the algorithm,as well as to address the deficiencies of primitive SP-RF method in bringing forth the bigger noise and information missing to original data,we propose an SP-RF-based adaptive random forest algorithm with inner-balanced sampling（ IBSBA-RF）. It is demonstrated by the experiment that the IBSBA-RF algorithm can noticeably reduce the FPR of predication result,and further increases the AUC and Balance measure of the prediction result as well.

作者徐可欣张文王永吉

机构地区中国科学院软件研究所基础软件国家工程研究中心中国科学院大学

出处《计算机应用与软件》 CSCD 2015年第8期215-219,233,共6页 Computer Applications and Software

基金国家自然科学基金项目(71101138 61379046 91318301) 北京市自然科学基金项目(4122087) 国家科技重大专项(2012ZX01039-004)

关键词软件缺陷预测非均衡抽样随机森林代价敏感 Software defect prediction Imbalance Sampling Random forest Cost-sensitive

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献17

1Hall T, Bowes D. The state of machine learning methodology in soft-ware fault predictionf C]//Machine Learning and Applications (ICM-LA),2012 11th International Conference on. IEEE, 2012, 2: 308-313.
2Hall T, Beecham S, Bowes D, et al. A systematic literature review onfault prediction performance in software engineering[ J]. Software En-gineering ,IEEE Transactions on, 2012, 38(6) : 1276 - 1304.
3Song Q, Jia Z, Shepperd M, et al. A general software defect-prone-ness prediction framework [ J]. Software Engineering, IEEE Transac-tions on, 2011,37(3): 356-370.
4Menzies T, Greenwald J, Frank A. Data mining static code attributesto learn defect predictors[J]. Software Engineering,IEEE Transac-tions on, 2007,33(1); 2-13.
5Kim S, Whitehead E J, Zhang Y. Classifying software changes: Cleanor buggy. [ J]. Software Engineering, IEEE Transactions on, 2008,34(2) : 181-196.
6Wang S, Yao X. Using Class Imbalance Learning for Software DefectPrediction[ J]. Reliability, IEEE Transactions on, 2013, 62(2) : 434-443.
7Li M, Zhang H, Wu R, et al. Sample-based software defect predictionwith active and semi-supervised leaming[ J]. Automated Software En-gineering, 2012,19(2) : 201 -230.
8Tosun A,Bener A. Reducing false alarms in software defect predictionby decision threshold optimization [ C ]//Proceedings of the 2009 3rdInternational Symposium on Empirical Software Engineering and Meas-urement. IEEE Computer Society, 2009 : 477 -480.
9Nitesh V Chawla, Kevin W Bowyer, Lawrence 0 Hall, et al. SMOTE:synthetic minority over-sampling technique [ J ]. Journal of Artificial In-telligence Research, 2002, 16( 1) :321 -357.
10Ganganwar V. An overview of classification algorithms for imbalanceddatasets[ J]. Int. J. Emerg. Technol. Adv. Eng,2012,2(4) :42 - 47.