摘要
针对克隆群映射大多基于相邻版本对比,当克隆群在中期版本短暂消失,实现多版本间映射存在困难,提出一种基于LDA和DBSCAN的软件多版本克隆群映射方法。首先,对所有版本的克隆群进行预处理,获得克隆群文档集合;其次,根据贝叶斯信息准则选取合适主题数T,进行主题概率模型训练,将所有克隆群都表示成T个主题的概率分布向量;再次,计算克隆群之间的JS距离,利用DBSCAN算法将同源的克隆群聚成一簇;最后,对同簇的克隆群按版本先后排序,得到多版本克隆群映射结果。对五款开源软件83个版本进行了映射实验,结果表明查全率、查准率均在98%以上,为克隆代码分析、管理提供了有力支持。
The present study on clone group mapping is mostly based on adjacent version comparison. When clone group dis- appear temporary in medium term version, it is difficult to implement mapping between multiple versions. This paper proposed a clone group mapping method based on the LDA and DBSCAN. First of all, it preprocessed clone group of all versions, and acquired collections of clone document. Secondly, it selected suitable subject number T based on the Bayesian information cri- terion, then it trained a theme probability model, and all clone group could be described as the vector of T themes probability distribution. Thirdly, it computed JS distance between clone group, used DBSCAN algorithm to put the homologous clone group into a cluster. Finally, it sorted clone group of the same cluster according to order of versions, and obtained clone map- ping results of multiple versions. Mapping experiment was conducted on 5 open-source softwares over 83 versions. Results show that the recall and precision is over 98%, which provides a strong support for analysis and management of clone code.
出处
《计算机应用研究》
CSCD
北大核心
2017年第2期481-486,共6页
Application Research of Computers
基金
国家自然科学基金资助项目(61363017
61462071)
内蒙古自然科学基金资助项目(2014MS0613
2015MS0606)
内蒙古自治区高等学校科学研究项目(NJZY16045)