摘要
【目的】针对现有话题检测方法对数据内在结构信息利用不够充分的问题,提出基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法,实现网络新闻话题的有效检测。【方法】通过综合考虑网络新闻间的共享最近邻个数、秩次等信息刻画新闻间的关联强度、构建共享最近邻图,并解决数据内在结构信息利用不充分的问题;利用降维、最优话题个数的决策、马尔科夫聚类、基于紧密中心度的自动话题描述等技术提升网络新闻话题检测效果。【结果】在两个网络新闻数据集上的实验结果表明,所提方法得到的ARI值更高,分别达到0.86和0.97。参与比较的LDA、K-Means、GMM等话题检测方法在两个网络新闻数据集上的ARI值均分别低于0.75和0.90。【局限】未在其他领域数据集以及多语言数据集上进一步验证。【结论】所提方法可以有效提升网络新闻话题检测性能,为话题检测关键技术研究提供有价值的参考。
[Objective]This paper proposes a topic detection method for online news,aiming to more effectively utilize the internal structure of data.[Methods]First,we examined the association strength among online news with the number and rank of their shared nearest neighbors.Then,we constructed a graph for the shared nearest neighbors,which improved the utilization of internal structure of the data.Finally,we detected the topics of online news with dimension reduction,the decision of the optimal number of topics,Markov clustering,and automatic topic description based on closeness centrality.[Results]We examined our new model with two data sets of online news and found the ARI values were up to 0.86 and 0.97,while the ARI values of the LDA,K-means,and GMM models were all less than 0.75 and 0.90.[Limitations]We need to evaluate the performance of the proposed method with data sets from other fields and the multilingual ones.[Conclusions]The proposed method could effectively detect the topics of online news and provide new direction for the future research.
作者
吴振峰
兰天
王猛猛
浦墨
张昱
刘志辉
何彦青
Wu Zhenfeng;Lan Tian;Wang Mengmeng;Pu Mo;Zhang Yu;Liu Zhihui;He Yanqing(Institute of Scientific and Technical Information of China,Beijing 100038,China;School of Economics,Renmin University of China,Beijing 100872,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2022年第10期103-113,共11页
Data Analysis and Knowledge Discovery
基金
国家重点研发计划基金项目(项目编号:2019YFA0707201)
中国科学技术信息研究所重点工作项目基金项目(项目编号:ZD2021-17,ZD2022-01)的研究成果之一。
关键词
共享最近邻
马尔科夫聚类
网络新闻
话题检测
Shared Nearest Neighbour
Markov Clustering
Online News
Topic Detection