期刊文献+

基于向量表示的代码搜索方法

A Code Search Approach Based on Vector Representation
下载PDF
导出
摘要 软件开发者在开发项目过程中往往需要引用大量由其他开发者开发的基础软件包。为获取除基础软件包开发文档外的使用方式,软件开发者需将代码关键词输入到代码搜索引擎搜索代码片段。文章提出一种基于向量表示的代码搜索方法,该方法收集Github和Stack Overflow数据集中的代码片段训练一个扩充代码词的skip-gram模型,并使用这个模型扩充从搜索文本中提取的与代码词关联的搜索关键词,得到搜索关键词上下文代码片段向量组,将搜索关键词上下文代码片段向量组和待匹配代码片段向量组编码后,计算余弦相似度并排序生成搜索结果。为验证算法的有效性,分别在Github数据集和Stack Overflow上验证。在Stack Overflow数据集上测试表明:58%的搜索能在第1个搜索结果找到正确答案;65%的搜索能在前5个答案中找到正确答案;72%的搜索能在前10个答案中找到正确答案,并在召回率和F值也有一定程度的提升。在Github数据集上测试表明:59%的搜索能在第1个搜索结果找到正确答案;67%的搜索能在前5个答案中找到正确答案;74%的搜索能在前10个答案中找到正确答案,并在召回率和F值也有一定程度的提升。针对大量数据的代码检索,本算法效果优于典型方法的搜索结果。 Software developers often need to refer a large number of base packages developed by other developers during the development.In order to obtain usage in addition to the base package development documentation,the software developer code keywords are entered into the code search engine search code snippet.This paper proposes a code search method based on vector representation,which collects code fragments in Github and Stack Overflow data sets,trains a skip-gram model of extended code words,and uses this model to augment the association with code words extracted from search text.The search keyword is obtained by getting a search keyword context code segment vector group,encoding the search keyword context code segment vector group and the to-be-matched code segment vector group,and calculating the cosine similarity ranking to generate the search result.In order to verify the effectiveness of the proposed algorithm,the validity of the algorithm was verified on the Github dataset and Stack Overflow.Results of the tests on the Stack Overflow dataset show that 58%of searches can find the correct answer in the first search result.65%of the search can find the correct answer in the first five answers.72%of the search can find the correct answer in the first ten answers.And a certain degree of improvement in the recall rate and F value.Results of the tests on the Github dataset show that 59%of searches can find the correct answer in the first search result.67%of the search can find the correct answer in the first five answers.74%of the search can find the right answer in the first ten answers and a certain degree of improvement in the recall rate and F value.The experimental results show that the algorithm proposed in this paper is better than the search results of typical methods for code retrieval of large amounts of data.
作者 慕江林 刘克剑 林晗 MU Jianglin;LIU Kejian;LIN Han(School of Computer and Software Engineering,Xihua University,Chengdu 610039 China;College of Management Science,Chengdu University of Technology,Chengdu 610059 China)
出处 《西华大学学报(自然科学版)》 CAS 2019年第5期106-112,共7页 Journal of Xihua University:Natural Science Edition
关键词 代码向量表示 代码搜索 语义编码 余弦相似度 code vector representation code search semantic coding cosine similarity
  • 相关文献

参考文献3

二级参考文献8

共引文献85

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部