期刊文献+

基于不平衡数据集的软件缺陷预测 被引量:7

Software defect prediction based on imbalanced datasets
下载PDF
导出
摘要 为了解决数据的不平衡性这一问题,提出一种利用分布函数合成新样本的过抽样和随机向下抽样相结合的算法。算法对降维后的主成分进行分布函数拟合,然后利用分布函数生成随机数,并对生成的随机数进行筛选,最后与随机向下抽样相结合。实验所用数据取自NASA MDP数据集,并与经典的SMOTE+向下抽样方法进行对比,从G-mean和F-measure值可以看出,前者的预测结果明显优于后者,预测精度更高。 Inorder to solve the problem of data imbalance, this paper proposed a new sampling method based on the combination of over-sampling which used the distribution function to get the new sample and the random under-sampling. In this paper, it first reduced the dimension of the original dataset. Then, it could get the random values by fitting the distribution function of principal components. It filtered some random values by truncating and removal of noise samples. This over-sampling method would combine with random under-sampling to get the training sets and testing sets. In this paper, the datasets were from NASA MDP datasets and the results would be compared with SMOTE+random under-sampling. It can draw the conclusion that the method using distribution function and random under-sampling is better than SMOTE+random under-sampling by comparing the G-means and F-measure value.
出处 《计算机应用研究》 CSCD 北大核心 2017年第7期2027-2031,共5页 Application Research of Computers
基金 国防重点项目资金资助项目(JCKY2016206B001) 国防一般资助项目(JCKY2014206C002)
关键词 软件失效预测 不平衡数据 主成分分析 分类回归树 software failure prediction imbalanced datasets principal component analysis classification regression tree
  • 相关文献

参考文献6

二级参考文献114

  • 1凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 2Bartlett P L, Traskin M. AdaBoost is consistent. Journal of Machine Learning Research, 2007, 8:2347-2368.
  • 3Schapire R E. The convergence rate of AdaBoost [open prob lem]//Proceedings of the 23rd Conference on Learning Theo ry. Haifa, Israel, 2010.
  • 4Japkowicz N. Learning from imbalanced data sets: A com parison of various strategies/ /Proceedings of the AAAI 2000 Workshop, 2000:10-15.
  • 5Chawla N V, Japkowicz N, Kotcz A. Workshop on learning from imbalanced data sets//Proceedings of the ICML' 2003. Washington, DC, USA, 2003.
  • 6Chawla N V, Japkowicz N, Kolez A. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Ex- plorations Newsletter, 2004, 6 (1) : 1-6.
  • 7He Hai-Bo, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
  • 8Liu X Y, Zhou Z H. The influence of class imbalance on cost-sensitive learning: An empirical study//Proeeedings of the 6th International Conference on Data Mining(ICDM'06). Hong Kong, China, 2006 : 970-974.
  • 9Wang B X, Japkowicz N. Boosting support vector machines for imbalanced data sets. Lecture Notes in Artificial Intelli- gence, 2008, 4994: 38-47.
  • 10Ertekin S, Huang J, Bottou L, Giles L. Learning on the border: active learning in imbalanced data classification// Proceedings of the ACM Conference on Information and Knowledge Management. Lisbon, Portugal, 2007: 127-136.

共引文献101

同被引文献53

引证文献7

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部