摘要
针对软件缺陷预测中数据标注代价较高及深度学习模型缺乏可解释性的问题,提出一种面向可解释性的软件缺陷预测主动学习方法。首先,基于主动学习技术,通过样本选择策略从目标项目中筛选出不确定性高的样本进行专家标注,并将这些标注样本放入源项目中以训练预测器。其次,利用领域知识对选定样本进行扰动,构建局部数据集,并通过线性模型在该数据集上模拟数据选择策略的行为,以实现模型的可解释性。实验结果显示:该方法在数据标注方面的指标性能要优于传统的主动学习基准方法;同时,在可解释性方面,该方法的RMSE指标也均低于LIME、全局代理模型以及RuleFit,能较好地解释“黑盒”模型。该方法不仅可以有效提高软件缺陷数据的标注效率,还可以实现模型的可解释性。
In allusion to the problems of high cost of data annotation and lack of interpretability of deep learning model in software defect prediction,an interpretability-oriented active learning approach for software defect prediction is proposed.Based on the active learning technology,samples with high uncertainty are filtered from the target project by means of sample selection strategy for expert annotation,and these annotated samples are put into the source project to train the predictor.The selected samples are perturbed by means of domain knowledge to construct a local dataset,and the behavior of the data selection strategy is simulated on this dataset by means of the linear model to achieve the interpretability of the model.The experimental results show that this approach has better performance than the traditional active learning benchmark approach in data annotation.Meanwhile,the RMSE metrics of the method are also lower than those of LIME,Global Agent Model,and RuleFit in terms of interpretability,which can better explain the black-box model.This approach can not only effectively improve the annotation efficiency of software defect data,but also achieve the interpretability of the model.
作者
王越
李勇
张文静
WANG Yue;LI Yong;ZHANG Wenjing(College of Computer Science and Technology,Xinjiang Normal University,Urumqi 830054,China;Key Laboratory of Safety-Critical Software of Ministry and Information Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
出处
《现代电子技术》
北大核心
2024年第20期101-108,共8页
Modern Electronics Technique
基金
新疆维吾尔自治区自然科学基金项目(2022D01A225)
新疆维吾尔自治区重点研发计划项目(2022B01007-1)。
关键词
软件缺陷预测
主动学习
可解释性
数据标注
数据选择策略
深度学习
software defect prediction
active learning
interpretability
data annotation
data selection strategy
deep learning