摘要
随着互联网的飞速发展,微博已经成为一个拥有大量信息和复杂数据的社交媒体网络,这使得对于发现网络舆情面临巨大的挑战。改进了一种基于MapReduce的并行化K-means划分聚类算法,并针对K-means(K均值)算法初始聚类中心难以选取的缺点,将Isodata(迭代自组织分析算法)算法得到的K值,作为K-means算法的初始聚类中心,提高聚类的精度。最后将改进的K-means算法用于微博热点主题发现中,通过与传统的K-means算法比较,证明了改进算法能有效提高聚类的精度,而且在处理海量数据时有较大优势。
With the rapid development of the Internet,micro-blog has become a social media network with a large amount of information and complex data,which makes it a great challenge to find public opinion on the Internet.In this paper,a parallel k-means partitioning clustering algorithm based on MapReduce was improved.To overcome the disadvantage that the initial clustering center of K-means algorithm is difficult to select,the K value obtained by Iterative Self-Organizing Analysis(Isodata)algorithm was used as the initial clustering center of K-means algorithm to improve the clustering accuracy.Experimental results on the micro-blog hot topic show that the proposed algorithm performs favorably against traditional K-means algorithm in terms of clustering precision and massive data problem.
作者
王林
许郡蒙
WANG Lin;XU Jun-meng(College of Automation,Xi'an University of Technology,Xi'an Shanxi 710048,China)
出处
《计算机仿真》
北大核心
2020年第8期121-125,共5页
Computer Simulation
基金
陕西省科学技术厅重点研发计划(2017ZDCXL-GY-05-03)。
关键词
划分聚类
热点话题
并行化
改进划分聚类算法
Partition clustering
Hot topic
Parallelization
Improved partition clustering algorithm