摘要
在软件缺陷预测中,缺陷数据集中往往存在冗余或不相关特征,需要对数据集进行特征选择.为了避免软件缺陷预测中常见的基于排序的特征选择方法的不稳定性,提出一种基于排序集成的特征选择方法.首先,分别执行相关系数、信息增益率和Relief F三种特征选择方法,得到特征排序序列,赋予每个特征一个权重,随后,将三种方法得到的每个特征的权重相加求和,作为该特征的总权重.最后,根据特征总权重对特征从高到低进行排序,并按照特征百分比从前往后依次选取特征.在实证研究中,以NASA的11个数据集为实验对象,采用逻辑回归算法构建预测模型,并采用AUC指标度量不同预测模型的分类性能.实验结果验证了基于排序集成的特征选择方法的有效性.
There are often redundant or irrelevant features in defect data sets in the field of software defect prediction,and feature selection is required. A feature selection method based on sorting integration is proposed which can avoid the instability of the common sorting feature selection methods. Firstly,Correlation,GainRatio and ReliefF are used respectively to obtain the feature sorting sequence based on which each feature can obtain a weight. Subsequently,the weights of each feature obtained by the three methods are added up to obtain the total weight of the feature. Finally,the features are sorted from high to low according to the total weights of the features,and the features are selected according to the percentage of the feature. In the empirical study,11 data sets of NASA were used as experimental objects,and Logic Regression algorithm was used to construct the prediction model. Moreover,AUC metric was used to measure the classification performance of different prediction models. The experimental results show the effectiveness of the feature selection method based on sorting integration.
作者
姜丽
姜淑娟
于巧
JIANG Li;JIANG Shu-juan;YU Qiao(School of Computer Science and Technology, China University of Mining and Technology ,Xuzhou 221116, China;School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China)
出处
《小型微型计算机系统》
CSCD
北大核心
2018年第7期1410-1414,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61673384
61502497)资助
关键词
软件缺陷预测
特征选择
特征权重
排序集成
software defect prediction
feature selection
feature weight
sorting integration