期刊文献+

一种结合代码片段和混合主题模型的软件数据聚类方法

Software Data Clustering Method Combining Code Snippets and Hybrid Topic Models
下载PDF
导出
摘要 使用主题模型进行文档聚类是众多文本挖掘任务中一种常见的做法。许多研究针对软件问答网站的数据,利用主题模型进行聚类来分析不同领域在社区的发展情况。然而,这些软件相关数据往往包含代码片段且文本长度分布不均,使用传统单一的主题模型对文本数据建模,易得到不稳定的聚类结果。文中提出了一种结合代码片段和混合主题模型的聚类方法,并使用Stack Overflow作为数据源,构造了在该平台上被提问数量排名前60的Python第三方库数据集,经过建模,该数据集最终划分为以下6个不同的领域:网络安全、数据分析、人工智能、文本处理、软件开发和系统终端。实验结果表明,在自动评估和人工评估的指标上,使用代码片段结合文本进行主题建模,在聚类结果划分的质量上表现良好,而联合多个模型进行实验,一定程度上提高了聚类结果的稳定性和准确性。 Using topic model to cluster documents is a common practice in many text mining tasks.Many studies use topic models to cluster data from software websites to analyze the development of communities in different fields.However,due to the fact that these software-related data often contain code snippets and the uneven distribution of text length,it is easy to get unstable clustering results by using traditional single topic model to handle this text data.This paper proposes a clustering method combining code snippets and hybrid topic models,and uses Stack Overflow as the data source to construct a Python third-party libraries dataset with the top 60 questions on the platform.After analyzing,it is finally divided into the following six different areas:network security,data analysis,artificial intelligence,text processing,software development and system terminal.Experimental results show that in terms of automatic evaluation and manual evaluation indicators,using code snippets combined with text for topic modeling,the quality of clustering results division performs well,while combining multiple models for experiments can improve the stability and accuracy of clustering results to a certain extent.
作者 魏林林 沈国华 黄志球 蔡梦男 郭菲菲 WEI Linlin;SHEN Guohua;HUANG Zhiqiu;CAI Mengnan;GUO Feifei(College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China;Key Laboratory of Safety-Critical Software,Ministry of Industry and Information Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China;Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210093,China)
出处 《计算机科学》 CSCD 北大核心 2024年第6期44-51,共8页 Computer Science
基金 国家自然科学基金(61772270,U2241216) 民航应急科学与技术重点实验室开放基金(NJ2022022)。
关键词 代码片段 主题模型 Stack Overflow PYTHON 聚类 Code snippets Topic model Stack Overflow Python Cluster
  • 相关文献

参考文献3

二级参考文献8

共引文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部