克隆代码有害性预测中分类不平衡问题的解决方法

Solution for classification imbalance in harmfulness prediction of clone code

下载PDF

导出

摘要针对克隆代码有害性预测中有害和无害数据分类不平衡的问题，提出一种基于随机下采样（RUS）的能够自动调整分类不平衡的K-Balance算法。首先对克隆代码提取静态特征和演化特征构建样本数据集；然后选取比例不同的分类不平衡新数据集；接着对已选取的新数据集进行有害性预测；最后，通过观察分类器的不同表现自动选择一个最适合的分类不平衡比例值。在7款C语言开源软件共170个版本上对克隆有害性预测模型的性能进行评估，并和其他分类不平衡解决方法进行对比，实验结果表明所提方法对有害和无害克隆的分类预测效果（受试者工作特征曲线下方面积（AUC）值）提高了2．62个百分点～36．70个百分点，能有效地改善分类不平衡的预测问题。 Focusing on the problem of imbalanced classification of harmful data and harmless data in the prediction of the harmful effects of clone code, a K-Balance algorithm based on Random Under-Sampling （RUS） was proposed, which could adjust the classification imbalance automatically. Firstly, a sample data set was constructed by extracting static features and evolution features of clone code. Then, a new data set of imbalaneed classification with different proportion was selected. Next, the harmful prediction was carried out to the new selected data set. Finally, the most suitable percentage value of classification imbalance was chosen automatically by observing the different performance of the classifier. The performance of the harmfulness prediction model of clone code was evaluated with seven different types of open-source software systems containing 170 versions written in C language. Compared with the other classification imbalance solution methods, the experimental results show that the proposed method is increased by 2.62 percentage points to 36.7 percentage points in the classification prediction effects （ Area Under ROC（ Receive Operating Characteristic） Curve （AUC）） of harmful and harmless clones. The proposed method can improve the classification imbalance prediction effectively.

作者王欢张丽萍闫盛

机构地区内蒙古师范大学计算机与信息工程学院

出处《计算机应用》 CSCD 北大核心 2016年第12期3468-3475,共8页 journal of Computer Applications

基金国家自然科学基金资助项目(61363017 61462071) 内蒙古自然科学基金资助项目(2015MS0606) 内蒙古自治区高等学校科学研究项目(NJZY16045)~~

关键词克隆代码有害性不平衡分类随机下采样参数搜索 code clone harmfulness imbalanced classification random undersampling parameter search

分类号 TP311.5 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献5

1张丽萍,张瑞霞,王欢,闫盛.基于贝叶斯网络的克隆代码有害性预测[J].计算机应用,2016,36(1):260-265. 被引量：8
2梅宏,王千祥,张路,王戟.软件分析技术进展[J].计算机学报,2009,32(9):1697-1710. 被引量：101
3张瑞霞,张丽萍,王春晖,侯敏.基于主题建模技术的克隆群映射方法[J].计算机工程与设计,2015,36(6):1524-1529. 被引量：11
4涂颖,张丽萍,王春晖,侯敏,刘东升.基于软件多版本演化提取克隆谱系[J].计算机应用,2015,35(4):1169-1173. 被引量：10
5侯敏,张丽萍,史庆庆,刘东升.基于后缀数组检测函数克隆[J].计算机应用研究,2014,31(4):1082-1085. 被引量：3

二级参考文献123

1叶进,林士敏.基于贝叶斯网络的推理在移动客户流失分析中的应用[J].计算机应用,2005,25(3):673-675. 被引量：12
2Shaw M. Truth Vs. knowledge: The difference between what a component does and what we know it does//Proeeedings of the 8th International Workshop Software Specification and Design. Budapest, Hungary, 1996: 181- 185.
3Binkley David. Source code analysis: A road map//Proceedings of the Future of Software Engineering. Minneapolis, MN, USA, 2007:104 -119.
4Dwyer Matthew B, Hatcliff John, Robby, Pasareanu Corina S, Visser Willem. Formal software analysis emerging trends in software model cheeking//Proceedings of the Future of Software Engineering. Minneapolis, MN, USA, 2007: 120- 136.
5Flemming Nielson, Hanne Riis Nielson, Chris Hankin. Principles of Program Analysis. Berlin, Germany: Springer Verlag, 2005.
6Jackson Daniel, Rinard Martin. Software analysis: A roadmap//Proceedings of the Future of Software Engineering. Limerick, Ireland, 2000:133-145.
7Aho Alfred V, Sethi Ravi, Ullman Jeffrey D. Compilers: Principles, Techniques, and Tools. New Jersey, USA: Addison-Wesley, 1986.
8Clarke E M, Jr Grumberg O, Peled D A. Model Checking, Cambridge, MA: MIT Press, 2000.
9Ball T, Rajamani S K. Automatically validating temporal safety properties of interfaces//Dwyer M B ed. Proceedings of the 8th SPIN Workshop. LNCS 2057. Springer, 2001:103-122.
10Chen H, Wagner D A. MOPS: An infrastructure for examining security properties of software//Proceedings of the 9th ACM Conference on Computer and Communications Security. Washengton, DC, USA, 2002:235-244.

共引文献116

1沈昌祥,张焕国,王怀民,王戟,赵波,严飞,余发江,张立强,徐明迪.可信计算的研究与发展[J].中国科学：信息科学,2010,40(2):139-166. 被引量：252
2张引,陈敏,廖小飞.大数据应用的现状与展望[J].计算机研究与发展,2013,50(S2):216-233. 被引量：377
3SHEN ChangXiang,ZHANG HuanGuo,WANG HuaiMin,WANG Ji,ZHAO Bo,YAN Fei,YU FaJiang,ZHANG LiQiang,XU MingDi.Research on trusted computing and its development[J].Science China(Information Sciences),2010,53(3):405-433. 被引量：38
4易彤.面向对象设计中软件度量学:回顾与热点[J].计算机应用研究,2011,28(2):427-434. 被引量：7
5钟浩,张路,梅宏.软件库调用规约挖掘[J].软件学报,2011,22(3):408-416. 被引量：4
6周国富,孙韵秋,蔡宇.CCNeter:C程序代码Petri网自动建模工具[J].计算机科学,2011,38(5):96-101. 被引量：2
7崔展齐,王林章,李宣东.一种目标制导的混合执行测试方法[J].计算机学报,2011,34(6):953-964. 被引量：18
8王环环,詹永照,陈锦富.可信软件分析与测试研究进展[J].计算机应用研究,2011,28(7):2401-2406. 被引量：8
9吴强.Web服务的分析、测试与验证[J].软件导刊,2011,10(7):12-14.
10黄沛杰,杨铭铨.代码质量静态度量的研究与应用[J].计算机工程与应用,2011,47(23):61-63. 被引量：6

1刘健,刘忠,熊鹰.基于PSO算法的SVM参数优化方法研究[J].计算机科学与技术汇刊（中英文版）,2013,2(1):9-16. 被引量：10
2盛明明,黄海燕,赵玉.基于克隆选择的差分进化算法及其在SVM中的应用[J].计算机科学,2015,42(B11):19-21. 被引量：2
3文益民,李健,杜飞明,陈方.集成学习算法在不平衡分类中的应用研究[J].计算技术与自动化,2009,28(2):103-106.
4郭丽娟,倪子伟,江弋,邹权.集成降采样不平衡数据分类方法研究[J].计算机科学与探索,2013,7(7):630-638. 被引量：3
5张伶卫,万文强.基于云计算平台的代价敏感集成学习算法研究[J].山东大学学报（工学版）,2012,42(4):19-23. 被引量：3
6陈涛.基于差分进化算法的支持向量回归机参数优化[J].计算机仿真,2011,28(6):198-201. 被引量：12
7利用VBA代码提取Word文档的最后一页[J].电脑迷,2015,0(12):71-71.
8晁拴社,楚恒,王兴.高光谱图像数据的多尺度多核SVM分类[J].计算机与现代化,2016(2):11-14. 被引量：5
9杨明,杨萍.一种面向不平衡分类数据的核求解算法[J].控制与决策,2007,22(6):652-656.
10刘忠宝,任娟娟,孔啸.利用基于互信息的不平衡分类方法识别稀有光谱（英文）[J].光谱学与光谱分析,2016,36(11):3746-3751.

计算机应用

2016年第12期

浏览历史

内容加载中请稍等...

克隆代码有害性预测中分类不平衡问题的解决方法

参考文献5

二级参考文献123

共引文献116

相关作者

相关机构

相关主题

浏览历史