摘要
针对源代码中一些非结构化的自然语言描述信息进行语义聚类,辅助开发人员开展程序理解。主要利用自然语言处理技术对程序中的标识符和注释进行预处理,将程序转换成词频矩阵;然后利用潜在语义索引技术对该词频矩阵进行层次聚类,并对每个聚类的标记进行推荐,辅助开发人员理解程序。在开源项目JEdit上进行验证,结果显示对该5万行规模的项目代码进行聚类时耗不足1分钟。因此,该技术能够快速对程序进行语义聚类,辅助开发人员快速理解程序。
This paper focuses on semantic clustering for program comprehension on the unstructured textual information.First,we employ the natural language processing technique to pre-process the natural language text in the program,and gets an intermediate representation,i.e.,term-document matrix.Then,we use the LSI(Latent Semantic Indexing)technique to analyze the matrix,and get a set of hierarchical clusters.In order to facilitate comprehension of each cluster,we also generate the recommendations of words to label each cluster.We evaluated our approach on the open source project,JEdit,and the results showed that the time required to cluster such scale of 50,000-LOC project was less than 1 minute.Hence,the proposed technique can quickly perform the program semantic clustering,supporting developers’quick program understanding.
作者
陈颖
CHEN Ying(School of Information Engineering,Yangzhou University,Yangzhou 225127,China)
出处
《软件导刊》
2019年第10期62-64,共3页
Software Guide
基金
江苏省教育信息化研究基金项目(20180104)
中国民航信息技术科研基地开放基金项目(CAAC-ITRB-201704)
关键词
程序理解
语义聚类
潜在语义索引
语义标注
program comprehension
semantics clustering
latent semantic indexing
semantics labelling