摘要
为了解决数据的不平衡性这一问题,提出一种利用分布函数合成新样本的过抽样和随机向下抽样相结合的算法。算法对降维后的主成分进行分布函数拟合,然后利用分布函数生成随机数,并对生成的随机数进行筛选,最后与随机向下抽样相结合。实验所用数据取自NASA MDP数据集,并与经典的SMOTE+向下抽样方法进行对比,从G-mean和F-measure值可以看出,前者的预测结果明显优于后者,预测精度更高。
Inorder to solve the problem of data imbalance, this paper proposed a new sampling method based on the combination of over-sampling which used the distribution function to get the new sample and the random under-sampling. In this paper, it first reduced the dimension of the original dataset. Then, it could get the random values by fitting the distribution function of principal components. It filtered some random values by truncating and removal of noise samples. This over-sampling method would combine with random under-sampling to get the training sets and testing sets. In this paper, the datasets were from NASA MDP datasets and the results would be compared with SMOTE+random under-sampling. It can draw the conclusion that the method using distribution function and random under-sampling is better than SMOTE+random under-sampling by comparing the G-means and F-measure value.
出处
《计算机应用研究》
CSCD
北大核心
2017年第7期2027-2031,共5页
Application Research of Computers
基金
国防重点项目资金资助项目(JCKY2016206B001)
国防一般资助项目(JCKY2014206C002)
关键词
软件失效预测
不平衡数据
主成分分析
分类回归树
software failure prediction
imbalanced datasets
principal component analysis
classification regression tree