摘要
软件自承认技术债是描述开发人员追求项目短期利益而有意实施的技术折中.前人工作表明,根据代码注释能够构建分类器,并用于识别自承认技术债.然而绝大多数分类方法未能考虑代码注释中较少自承认技术债所造成的类别不平衡问题.即使考虑,已有方法也缺乏理想效果.文中提出基于交叉过采样的方法,即首先将技术债数据切分成短文本池,继而在短文本池中随机选择短文本进行拼接来生成新的技术债样本,这种做法有效扩展自承认技术债数据,成功解决了文本数据的类别不平衡问题.此外,采用词向量空间法来构建特征空间,利用信息增益这一特征选择方法来构建多个分类器以识别自承认技术债.实验结果表明文中工作在Precision、Recall和F1-score等3个性能量度上的结果普遍优于前人所提方法,能够帮助项目人员有效识别软件自承认技术债.
Software self-admitted technical debt(SATD)refers to technical compromises that are made to gain short-term benefits of software project.Prior work on SATD has shown that the source code comments can be used to construct classifiers for the detection of SATD,but most current classification approaches do not consider the class imbalance problem caused by the less SATDs.There has been no effective solution to this problem.In this paper,we proposed a cross oversampling approach to expand the number of SATD.The SATD data are first cut into a short text pool,and then the new SATD can be generated by randomly integrating different short texts.Moreover,vector space model is used to construct feature space and information gain is used to select features for training multiple classifiers to recognize SATD.Experimental results show that our approach is better than previous ones in precision,recall and F1-score,and can help developers to identify software SATD effectively.
作者
黄城
徐克辉
郑尚
于化龙
HUANG Cheng;XU Kehui;ZHENG Shang;YU Hualong(School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212100, China;China Ship Research and Development Academy,Beijing 100101,China)
出处
《江苏科技大学学报(自然科学版)》
CAS
2020年第5期51-56,共6页
Journal of Jiangsu University of Science and Technology:Natural Science Edition
基金
国家自然科学基金资助项目(61305058,61572242)
江苏省自然科学基金资助项目(BK20130471)
中国博士后特别资助计划项目(2015T80481)
中国博士后科学基金资助项目(2013M540404)
江苏省博士后基金资助项目(1401037B)。
关键词
自承认技术债
类别不平衡
交叉过采样
特征选择
self-admitted technical debt
class imbalance
cross oversampling
feature selection