摘要
【目的】通过新闻热点话题检测技术提取热点新闻话题,减轻用户的新闻阅读压力。【方法】在TF-IDF方法基础上,通过均衡段落的位置加权方式(WTF-IDF)进行关键词提取;将K-means聚类作为基方法,在分层聚类中引入分话题向量完成话题聚类;提取标题高频词实现话题描述。【结果】WTF-IDF方法在关键词抽取数为3时与TF-IDF方法相比F1值提升5.4%;基于WTF-IDF与分话题向量的分层聚类与分层TF-IDF的K-means聚类相比准确提升3.1%。【局限】关键词抽取未考虑短语形式;分层聚类方法增加了算法时间复杂度。【结论】本文提出的关键词抽取和分层聚类方法可以改善新闻热点话题检测效果,话题描述得到的话题短语也达到一定的代表性与可读性。
[Objective] This paper proposes a model to detect the topics of trending news stories, aiming to improve user experience of news reading. [Methods] We modified the TF-IDF method with the weighting of balanced paragraphs(WTF-IDF). We also improved the K-means clustering model with sub-topic vectors in hierarchical clustering. Finally, we extracted high frequency words from titles with the new model. [Results] The F1 value of our model was 5.4% higher than the TF-IDF method(with three extracted keywords). The hierarchical clustering accuracy based on WTF-IDF and sub-topic vector was 3.1% higher than the single-layer K-means clustering. [Limitations] Our model does not include phrases extraction method and the hierarchical clustering method is complex. [Conclusions] The proposed method could effectively detect topics of trending news reports.
作者
魏家泽
董诚
何彦青
刘志辉
彭柯芸
Wei Jiaze;Dong Cheng;He Yanqing;Liu Zhihui;Peng Keyun(Institute of Scientific and Technical Information of China,Beijing 100038,China;Science and Technology Bureau of Ganzi Prefecture,Kangding 626000,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2020年第10期70-79,共10页
Data Analysis and Knowledge Discovery
基金
中国科学技术信息研究所重点工作项目“多语言科技信息服务关键技术研究与应用(二期)”(项目编号:ZD2019-20)和“俄汉跨语言知识发现与服务研究”(项目编号:ZD2020-10)的研究成果之一
关键词
均衡段落
分话题向量
热点话题检测
分层聚类
Equalized Paragraph
Sub-topic Vector
Hot Topic Detection
Hierarchical Clustering