期刊文献+

基于增量文本聚类算法的热点话题检测研究

Research on hot topic detection based on incremental text clustering algorithm
下载PDF
导出
摘要 针对传统TF-IDF方法提取文本特征时无法增量更新以及传统Single-Pass算法聚类准确率较低的问题,本文通过使用已有的语料库来设置IDF表并更新的方法,来减少TF-IDF计算时对语料库的依赖性,通过均值计算簇中心来提高Single-Pass算法在聚类时的准确率。利用各大平台获取的新冠肺炎新闻数据对模型进行验证。结果表明,该方法使得传统的TF-IDF提取关键词时可以增量更新,利用改进的Single-Pass算法使得综合评价指标提高了8.64%。相对于传统的Single-Pass算法,改进的Single-Pass算法只需要与一部分候选簇进行比较,有效地降低了比较次数,提高了聚类的准确性以及效率。 In order to address the problems of traditional TF-IDF methods not being able to incrementally update and having low accuracy when extracting text features and the traditional Single-Pass algorithm has a low clustering accuracy in traditional Single-Pass algorithm clustering this paper reduces the dependency on the corpus when calculating TF-IDF by using an existing corpus to set up IDF table and update it.It improves the accuracy of Single-Pass algorithm in clustering by computing the mean to determine cluster centers.The model is validated using COVID-19 news data obtained from various platforms.The results show that this method allows for incremental updating of traditional TF-IDF keywords extraction,and the improved Single-Pass algorithm can increase the comprehensive evaluation index by 8.64%.Compared to the traditional Single-Pass algorithm,the improved Single-Pass algorithm only needs to compare with a subset of candidate clusters,effectively reducing the number of comparisons and improving the accuracy and efficiency of clustering.
作者 魏艺泽 郭慧 时晓旭 WEI Yize;GUO Hui;SHI Xiaoxu(School of Computer Science,North China Institute of Science and Technology,Yanjiao 065201,China)
出处 《华北科技学院学报》 2024年第1期76-81,124,共7页 Journal of North China Institute of Science and Technology
基金 科技创新2030重大项目(2021ZD0114203) 国家社会科学基金项目(21BSH072)。
关键词 Single-Pass 文本聚类 文本相似度 热点话题检测 TF-IDF Single-Pass text clustering text similarity hot topic detection TF-IDF
  • 相关文献

参考文献11

二级参考文献82

共引文献198

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部