程序代码相似度检测技术的研究与实现

Research and Implementation of Program Code Similarity Detection Technology

下载PDF

导出

摘要针对传统相似度算法应用在程序设计课程作业检测中精度较低这一问题,通过研究最长公共子序列等算法,发现其优缺点,并提出了一种属性计数和结构度量技术相结合的程序相似度计算方法。该方法首先对源程序进行初步处理,将程序中的注释语句和空格删除,再确定常用元素及常用结构,然后利用Lex统计、抽取程序元素;利用开源代码ucc生成语法树,之后抽取相应的语法结构;最后生成特征向量,并计算代码相似度。实验结果表明该方法比最长公共子序列算法精度提高了10.6%。 Aiming at the problem that the traditional similarity algorithm is applied to the detection of programming errors in the program design course,the advantages and disadvantages of the longest common sub-sequence are studied,and a method based on attribute counting and structural measurement is proposed.Combined with the program similarity calculation method,the method first of the source program for the initial treatment,the program will be deleted in the annotation and space,re-determine the common elements and common structure,and then use Lex statistics,extract the program elements;use open source code ucc generated Grammar tree,and then extract the corresponding grammatical structure;finally generate the eigenvector,and calculate the code similarity.The experimental results show that the proposed method is 10.6%more accurate than the longest common sub-sequence algorithm.

作者卫军超耿楠 Wei Junchao;Geng Nan(College of Information Engineering,Northwest A&F University,Yangling Shaanxi 712100,China)

机构地区西北农林科技大学信息工程学院

出处《信息与电脑》 2017年第3期99-101,107,共4页 Information & Computer

基金西安交通工程学院校级教改项目(项目编号:150006B)

关键词属性计数法结构度量技术相似度度量 attribute counting method structure measurement technique similarity measure

分类号 TP311.11 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献6

1郑凯,欧阳林艳,林强,刘芳冰.LCS算法与编辑距离算法的研究[J].信息通信,2015,28(5):22-23. 被引量：5
2古平,张锋,周海涛.一种程序源代码相似度度量方法[J].计算机工程,2012,38(6):37-39. 被引量：7
3陈波,王延章.基于关联token自适应字段匹配算法设计[J].计算机工程与设计,2010,31(19):4238-4241. 被引量：4
4高灿,侯秀萍,孙士明.基于抽象语法树的修改影响分析方法[J].长春工业大学学报,2012,33(4):387-390. 被引量：7
5于世英,袁雪梅,卢海涛,任家东,李硕.基于序列聚类的相似代码检测算法[J].智能系统学报,2013,8(1):52-57. 被引量：5
6钟美,张丽萍,刘东升.基于XML的C代码抄袭检测算法[J].计算机工程与应用,2011,47(8):215-218. 被引量：15

二级参考文献49

1雷海虹,缪力,张大方.面向对象程序的两种修改影响分析方法[J].计算机工程与科学,2005,27(5):101-103. 被引量：5
2Ahmed K E,Panagiotis G I,Vassilios S V.Duplicate record detection:a survey[J].IEEE Transactions on Knowledge and Data Engieering,2007,19(1):1-15.
3William E W.Overview of record linkage and current research directions[R].US Bureau of the Census,Stafistical Research Report Series RRS2006/02,2006.
4William E W,Pradeep R,Stephen E.A comparison of string distance metrics for name-matching tasks[C].Acapulco,Mexico:Proceeding LICAI,2003:73-78.
5Nick kSunita S,Divesh S.Record linkage:similarity measures and algorithms[C].Chicago,USA:Proceedings of the ACM SIGMOD International Conference on Management of Data.ACM Press,2006.
6Cohen W,Ravikumar P,Feinberg S.A comparison of string mettics for matching names and records[C].New York:proceedings of KDD Workshop on Data Cleaning and Object Consolidation.ACM Press,2003:103-108.
7Sheila T,Craig A K,Steven M.Learning domain-indepondent string transformation weights for high accuracy object identification[C].Edrnonton,Albcrta,Canada:proceedings of ACM SIGKDD.ACM Press,2002.
8Mikhail B,Raymond J.Adaptive duplicate detection using learnable string similarity measures[C].Washington.DC:Procoedings of ACM SIGKDD,2003:39-48.
9Joaehims T.SVMlight support vector machine[EB/OL].http://svmlight.joachims.org,2007.
10Indrajit B,Lise G.Collective entity resolution in relational data[J].ACM Transaction on Knowledge Discovery from Data,2007(1):1-36.

共引文献33

1张丽萍,刘东升,李彦臣,钟美.一种基于AST的代码抄袭检测方法[J].计算机应用研究,2011,28(12):4616-4620. 被引量：8
2牛永洁,张成.多种字符串相似度算法的比较研究[J].计算机与数字工程,2012,40(3):14-17. 被引量：37
3牛永洁.RKR-GST算法在.NET中的分析与实现[J].信息技术,2012,36(3):171-174. 被引量：3
4刘呈龙,贾胜颖,张丽萍,刘东升.基于AST的代码抄袭检测方法研究[J].计算机工程与设计,2012,33(4):1660-1664. 被引量：7
5张丽萍,刘呈龙,刘东升.基于AST的多语言代码抄袭检测方法研究[J].内蒙古师范大学学报（自然科学汉文版）,2012,41(4):385-392. 被引量：3
6杨春明,杜炯,王磊.分布式程序设计实验平台的设计与应用[J].实验室研究与探索,2012,31(8):54-58. 被引量：5
7于世英,袁雪梅,卢海涛,任家东,李硕.基于序列聚类的相似代码检测算法[J].智能系统学报,2013,8(1):52-57. 被引量：5
8谷春英,张顺利.改进指纹和LSC加权的恶意程序代码相似度估计算法[J].科学技术与工程,2013,21(10):2871-2874. 被引量：1
9冯君远,赖明钦,李启良.C语言源代码抄袭识别的研究[J].福建电脑,2013,29(5):34-36. 被引量：2
10刘楠,韩丽芳,夏坤峰,曲通.一种改进的基于抽象语法树的软件源代码比对算法[J].信息网络安全,2014(1):38-42. 被引量：9

1李明.文本文件差异对比算法研究[J].软件,2017,38(12):216-219. 被引量：2
2王林玉,黄俭.计算机教学中的抄袭检测及其社会网络分析研究[J].计算机教育,2018(1):95-97. 被引量：1
3李城,沙俊淞,武文.基于最长公共子序列的微博谣言溯源研究[J].计算机与现代化,2018(1):107-112. 被引量：8
4王虹胜.基于ACO的多序列间最长公共子序列查询[J].现代计算机,2018,24(2):11-14.
5王前东.一种带匹配路径约束的最长公共子序列长度算法[J].电子与信息学报,2017,39(11):2615-2619. 被引量：9
6韩宸望,林晖,饶绪黎,黄川.基于代理模式的SQL注入过滤方法[J].计算机系统应用,2018,27(1):98-105. 被引量：1
7陈文华,岳雅,余本国.一种粒子群优化的改进SIFT特征点的图像匹配[J].云南师范大学学报（自然科学版）,2018,38(2):56-59. 被引量：3
8王鹏.计算机程序抄袭检测系统的设计方案[J].电子技术与软件工程,2017(18):153-153. 被引量：2
9朱建清,曾焕强,杜永兆,雷震,郑力新,蔡灿辉.基于新型三元卷积神经网络的行人再辨识算法[J].电子与信息学报,2018,40(4):1012-1016. 被引量：2
10江龙泉,张波,胡志鹏,丁峻宏,刘波.问答系统中基于语义核函数的问题分类算法[J].上海师范大学学报（自然科学版）,2018,47(1):53-56. 被引量：1

信息与电脑

2017年第3期

浏览历史

内容加载中请稍等...

程序代码相似度检测技术的研究与实现

参考文献6

二级参考文献49

共引文献33

相关作者

相关机构

相关主题

浏览历史