摘要
数据集的质量对软件缺陷预测模型的效果至关重要。针对传统数据集特征过多导致的学习速度较慢以及缺陷样本远少于无缺陷样本的类不平衡问题,提出一种基于主成分分析法(Principal Component Analysis,PCA)和数据加权增广的数据集优化方法:通过PCA方法对数据进行降维,有效去除了冗余数据,减少模型的学习时间,提高检测效率;通过数据加权增广方法,增加了有缺陷类在样本中所占的比例,有效提高了缺陷样本的识别率。
The quality of data sets is very important to the effect of software defect prediction model.Aiming at the problems of slow learning speed caused by too many features of traditional dataset and class imbalance caused by far fewer defect samples than non-defect samples,a data set optimization method based on principal component analysis(PCA)and data weighted augmentation is proposed:PCA method is used to reduce the dimension of the data,which can effectively remove the redundant data,reduce the learning time of the model and improve the detection efficiency.Through the method of data weighted enlargement,the proportion of the defective categories in the sample is increased,and the recognition rate of the defective samples is effectively improved.
作者
李冰
LI Bing(Academy of Military Sciences, Beijing 100091, China)
出处
《信息工程大学学报》
2022年第1期87-92,共6页
Journal of Information Engineering University
基金
国家自然科学基金资助项目(61272041)。
关键词
软件缺陷数据集
数据优化
主成分分析
software defect data set
data optimization
principal component analysis