摘要
针对传统TF-IDF方法提取文本特征时无法增量更新以及传统Single-Pass算法聚类准确率较低的问题,本文通过使用已有的语料库来设置IDF表并更新的方法,来减少TF-IDF计算时对语料库的依赖性,通过均值计算簇中心来提高Single-Pass算法在聚类时的准确率。利用各大平台获取的新冠肺炎新闻数据对模型进行验证。结果表明,该方法使得传统的TF-IDF提取关键词时可以增量更新,利用改进的Single-Pass算法使得综合评价指标提高了8.64%。相对于传统的Single-Pass算法,改进的Single-Pass算法只需要与一部分候选簇进行比较,有效地降低了比较次数,提高了聚类的准确性以及效率。
In order to address the problems of traditional TF-IDF methods not being able to incrementally update and having low accuracy when extracting text features and the traditional Single-Pass algorithm has a low clustering accuracy in traditional Single-Pass algorithm clustering this paper reduces the dependency on the corpus when calculating TF-IDF by using an existing corpus to set up IDF table and update it.It improves the accuracy of Single-Pass algorithm in clustering by computing the mean to determine cluster centers.The model is validated using COVID-19 news data obtained from various platforms.The results show that this method allows for incremental updating of traditional TF-IDF keywords extraction,and the improved Single-Pass algorithm can increase the comprehensive evaluation index by 8.64%.Compared to the traditional Single-Pass algorithm,the improved Single-Pass algorithm only needs to compare with a subset of candidate clusters,effectively reducing the number of comparisons and improving the accuracy and efficiency of clustering.
作者
魏艺泽
郭慧
时晓旭
WEI Yize;GUO Hui;SHI Xiaoxu(School of Computer Science,North China Institute of Science and Technology,Yanjiao 065201,China)
出处
《华北科技学院学报》
2024年第1期76-81,124,共7页
Journal of North China Institute of Science and Technology
基金
科技创新2030重大项目(2021ZD0114203)
国家社会科学基金项目(21BSH072)。