期刊文献+

基于CWMD和SP的微博话题发现算法

A Weibo topic discovery algorithm based on CWMD and SP
下载PDF
导出
摘要 针对传统微博话题发现算法中,计算文本距离时仅仅考虑词与词的距离和最小而产生的问题,提出了使用CWMD(cos-word mover's distance)作为聚类标准的算法。结合余弦距离和WMD计算句子之间的相似性;使用TF-IDF向量代替WMD中词频权重向量,将所有词对文档的贡献纳入考量;使用CWMD代替传统的距离作为SP(Single-Pass)聚类的标准;并且提出了构建文本待定池的SP算法,旨在避免话题发现过程中数据到达的先后顺序对结果产生的影响,从而提高话题发现的准确性。通过对中文语料数据库中的部分数据进行对比实验,证实了该话题发现模型效果更好。进一步将该模型应用到爬取的微博数据中,将提取的簇的关键词和微博热搜话题进行比对,结果显示二者具有很强的相关性。 In the traditional microblog topic discovery algorithm,only the minimum sum of the distance between words is considered when calculating the text distance.Aiming at this problem,the CWMD(cos-word mover's distance)algorithm is proposed as the standard of clustering.The algorithm combines the cosine distance and WMD to calculate the similarity of text data,uses TF-IDF to replace the word frequency weight vector in the traditional WMD to take into account the contribution of words to the document,uses CWMD instead of the traditional distance as the standard of SP(Single-Pass)clustering,and proposes SP algorithm to construct text pending pool to avoid the impact of the text arrival order in the topic discovery process,thereby improving the accuracy of topic discovery.Through comparative experiments on some data in the Chinese corpus database,it is found that the proposed topic discovery model is more effective.The model is further applied to the crawled Weibo text data by Python,and the keywords of the extracted clusters are compared with the hot topics on Weibo.The results showed a strong correlation between them.
作者 孙悦 罗倩 方梁雨 SUN Yue;LUO Qian;FANG Liangyu(School of Information and Communication Engineering,Beijing Information Science&Technology University,Beijing 100192,China)
出处 《北京信息科技大学学报(自然科学版)》 2021年第2期76-81,共6页 Journal of Beijing Information Science and Technology University
基金 中国铁道科学研究院·机车走行部状态监测系统(9151524108)。
关键词 词向量加权 余弦距离 词移距离 增量聚类 话题发现 weighted-word2vec cosine distance word mover's distance text clustering topic discovery
  • 相关文献

参考文献3

二级参考文献8

共引文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部