摘要
针对软件故障数据中正例样本相对较少且大量样本标注困难的现实场景,已知未标注样本中包含用于建立故障检测模型的大量有用信息,提出仅用正例和未标注数据构建分类模型对软件开发过程中的故障进行检测的半监督学习方法。首先采用合成少数类过采样SMOTE算法对数据集中的正例样本进行过采样,平衡数据集中的类分布。在此基础上合理构建正例集合和未标注集合,采用POSC 4.5和Bagging算法构建软件故障决策树集成分类器。通过对NASA MDP数据库中的12个数据集进行对比实验,结果表明,仅用正例和未标注数据建模可以得到与有监督学习方法相近的软件故障检测率,且集成分类器方法比单分类器方法具有更高的检测率,未标注样本集大小的软件故障检测率同样有影响。
The software fault datasets were highly possible that there were only a small set of labeled positive data and most of the data was hard to be labeled, which contained a great deal of useful information for building a prediction model for software fault detection. This paper proposed a semi-supervised classification model to predict the faults only using the positive and unlabeled data during the software development process, The proposed method firstly used the SMOTE ( synthetic minority oversampling technique) method to balance the class distribution by oversampling on the rare positive dataset. Then partitioned the improved dataset into positive subset and unlabeled subset properly. Third used the POSC 4.5 algorithm and Bagging algorittnn to build a decision tree classification ensemble model for software fault prediction using these subsets. The experiments were conducted on 12 datasets from the NASA MDP database. The experiment results show that the fault detection rate based on positive and unlabeled learning is close to the supervised learning method. The ensemble classifier method can effectively improve detective performance than a single classifier method, and the unlabeled level can effect the fault detection somehow.
出处
《计算机应用研究》
CSCD
北大核心
2015年第11期3324-3327,3331,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(61303125)