摘要
对海量的互联网新闻进行快速热点聚类是一个重要的研究方向。针对大规模文本聚类的几个关键问题(相似度计算、分布式聚类、聚类结果概要生成),文中设计并实现了一个基于Spark计算框架的分布式新闻聚类系统。该系统采用GPU加速的深度相似度算法进行新闻文本的相似度计算,得到新闻之间的相似关系,并采用图聚类算法进行新闻聚类,最后采用标题压缩技术形成热点描述,生成最终的聚类结果。实验结果证明,文中提出的系统具有较高的执行效率和良好的可扩展性,可以有效地处理大规模新闻的热点聚类任务。
Rapid clustering of massive Internet news to generate hot topic is an important research direction.Aiming at several key problems of large-scale text clustering:similarity calculation,distributed clustering and clustering result summary generation,this paper designed and implemented a Spark-based distributed news clustering system.Firstly,the GPU-accelerated deep similarity algorithm is used to calculate the similarity relationship of news texts.Then the graph clustering algorithm is used for news clustering.Finally,a short title for each class is generated as the class description.Experiments show that the proposed system has high performance and good scalability,and can effectively handle hotspot clustering tasks of large-scale news.
作者
卢献华
王洪俊
LU Xian-hua;WANG Hong-jun(Beijing Information Science and Technology University,Beijing 100101,China;Beijing TRS Information Technology Co.,Ltd.,Beijing 100101,China)
出处
《计算机科学》
CSCD
北大核心
2019年第S11期220-223,共4页
Computer Science
关键词
分布式图聚类
深度相似度计算
GPU加速
标题压缩
大数据
Distributed graph clustering
Depth similarity calculation
GPU acceleration
Title compression
Big data