摘要
代码克隆能够加速软件开发但是也会导致缺陷重复发生和软件质量问题。部分类型的代码克隆在字面上相似度低,导致识别困难。针对这一问题,提出一种基于程序向量树的代码克隆检测方法。首先,基于统计语言模型抽取词法单元的特征表示,分析不同字面单词之间的语义相似性;接着,通过语法分析提取程序的抽象语法树(AST),为叶子节点赋予对应字面单词的特征表示,将抽象语法树转化为程序向量树;最后,提出一种加权编码规则,在考虑区分不同树节点重要程度的基础上,将程序向量树编码为定长向量,而具有相似向量表示的代码片段被判定为代码克隆。实验结果表明,在真实代码克隆的大规模标准数据集BigCloneBench上,针对在字面上相似度较低的Moderately Type-3和Type-4类型克隆进行检测时,该方法均优于当前的主流方法,包括NiCad、Deckard、SourcererCC和Oreo等,证实了该方法的有效性。
Code cloning facilitates software development but also causes recurring bugs and software quality problems.Some types of code clones have very low similarity in literal, leading to difficulty of detection. Aiming at this problem, this paper proposes one method of code clone detection based on the program vector tree. First, the feature representations of lexical units are extracted based on a statistical language model and the semantic similarities between different literal words are analyzed. Second, the abstract syntax tree(AST) of each program is extracted by syntactical analysis, and each AST is transformed into a program vector tree with each leaf node assigned a feature representation of the corresponding literal word. Finally, one weighted encoding mechanism is proposed for encoding each program vector tree into a fixed-sized vector, considering different weight information of nodes in the tree, and code fragments with similar vector representations are reported as code clones. Experimental results on Big CloneBench, an existing large benchmark of real code clones, show that this method outperforms many prominent clone detection methods, including NiCad, Deckard, SourcererCC and Oreo, etc., in detecting Moderately Type-3 or Type-4 clones that have low similarity in literal, which verifies the validity of this method.
作者
曾杰
贲可荣
张献
李晓伟
周全
ZENG Jie;BEN Kerong;ZHANG Xian;LI Xiaowei;ZHOU Quan(College of Electronic Engineering,Navy University of Engineering,Wuhan 430033,China;Jinghang Research Institute of Computing and Communication,Beijing 100074,China;School of Computer Science,Wuhan University,Wuhan 430072,China)
出处
《计算机科学与探索》
CSCD
北大核心
2020年第10期1656-1669,共14页
Journal of Frontiers of Computer Science and Technology