摘要
源码相似性度量是代码推荐、缺陷监测、代码搜索等很多软件工程领域任务的基础工作。传统的源码相似性度量方法主要利用统计方法从代码的结构属性、文本特性两方面进行度量,缺乏对代码的语义相似性研究。为解决此类问题,在词嵌入基础上提出结合TF-IDF和Word2vec的向量空间模型,利用向量间距离衡量代码间的相似性,融合代码的语义信息和统计信息。实验结果表明,相比于传统基于统计的方法,该模型效果提高了15%。
Source code similarity measurement is the basis of code recommendation,defect monitoring,code search and many other tasks in the field of software engineering.Traditional source code similarity measurement methods mainly use statistical methods to measure from both the structural properties of the cod e and textual properties,but lack research on code semantic similarity.In order to solve such problems,a vector space model combining TF-IDF and Word2vec is proposed on the basis of word embedding,using inter-vector distance to measure the similarity between codes,and the model integrates the semantic and statistical information of codes.The experimental results show that the model is 15% more effective than the traditional statistics-based approach.
作者
钱程
谢春丽
王梦琦
权雷
QIAN Cheng;XIE Chun-li;WANG Meng-qi;QUAN Lei(School of Wisdom Education,Jiangsu Normal University,Xuzhou 221116,China;Department of Computer Science&Technology,Jiangsu Normal University,Xuzhou 221116,China)
出处
《软件导刊》
2021年第7期97-101,共5页
Software Guide
基金
国家自然科学基金项目(61502212)
江苏省高等学校大学生创新创业训练计划项目(201910320134Y)
2019年第一批谷歌支持教育部产学合作协同育人项目(2e317703-2af0-4ecb-ba7c-35e290356017)。