摘要
近年来,基于主题建模技术的代码理解方法成为研究热点之一.该类方法期望利用主题建模技术从软件代码中挖掘功能性主题,进而利用功能性主题帮助开发人员理解软件功能及其代码实现.然而,从代码挖掘出的主题中,功能性主题与其他类型主题(如横切性主题)混杂在一起,需要人工识别功能性主题;由于现有工作大多仅提供主题关联的词等基本信息,导致识别及应用功能性主题的过程费时费力.针对以上问题,本文提出了一种基于主题建模和静态分析技术的软件代码功能性主题获取方法.该方法在利用一组启发式过滤规则对代码进行预处理的基础上,基于主题建模技术从代码中挖掘原始主题;进而,基于代码静态分析获得的代码间结构关系,提出了一种名为主题内聚度的技术从原始主题中自动识别功能性主题;最后,定位主题关联的代码片段,并利用代码及其注释为主题生成自然语言描述文本,进一步帮助开发人员理解主题所体现的软件功能及其代码实现细节.本文基于一组开源软件代码进行了方法评估,评估结果表明本文方法能够有效获取功能性主题及其关联信息,进而帮助开发人员更好地理解软件功能及其代码实现.
Recently,topic modeling-based source code comprehension has become one of the research hotspots. Researchers mine functional topics from source code with topic modeling techniques,and use these functional topics to help developers comprehend the functional concerns of a software system and the corresponding implementations in source code. However,because diverse kinds of topics,including functional topics and other noisy topics,are mixed together in the raw topics mined from source code,developers have to identify functional topics manually. The manual identifcation process is difcult and time-consuming due to the fact that only basic information(e.g. associated words) is provided by previous approaches. In this paper,we propose a topic modeling and static analysis-based approach to obtain functional topics from source code. Firstly,we execute source code preprocessing with a series of Filtering Heuristics,and then mine raw topics from source code using topic modeling techniques. Then,we conduct static analysis on source code to obtain structural relationships,based on which we propose a novel metric called Topic Cohesion to identify functional topics from the raw topics. Finally,we locate source code elements that implement the topics,and generate topic descriptions using source code comments for the developers to comprehend the topics. Experiments on a set of open source software show that our approach can efectively obtain functional topics and their associated information,which helps developer comprehend the functional concerns of a software system and the corresponding implementations in source code.
出处
《中国科学:信息科学》
CSCD
2014年第1期54-69,共16页
Scientia Sinica(Informationis)
基金
国家高技术研究发展计划(863)(批准号:2012AA011202)
国家重点基础研究发展计划(973)(批准号:2011CB302604)
国家自然科学基金(批准号:61121063
U1201252)资助项目
关键词
程序理解
主题模型
静态分析
数据挖掘
信息检索
program comprehension
topic model
static analysis
data mining
information retrieval