期刊文献+

基于改进Canopy-K-means算法的并行化研究 被引量:10

Research on Parallelization Based on Improved Canopy-K-means Algorithm
下载PDF
导出
摘要 随着互联网数据的快速增长,原始的K-means算法已经不足以应对大规模数据的聚类需求;为此,提出一种改进的Canopy-K-means聚类算法;首先面对Canopy算法中心点随机选取的不足,引入“最大最小原则”优化Canopy中心点的选取;接着借助三角不等式定理对K-means算法进行优化,减少冗余的距离计算,加快算法的收敛速度;最后结合MapReduce框架并行化实现改进的Canopy-K-means算法;基于构建的微博数据集,对优化后的Canopy-K-means算法进行测试;试验结果表明:对不同数据规模的微博数据集,优化后算法的准确率较K-means算法提高了约15%,较原始的Canopy-K-means算法提高了约7%,算法的执行效率和扩展性也有较大提升。 With the rapid growth of Internet data,the original K-means algorithm is no longer sufficient to meet the clustering needs of large-scale data.To this end,an improved Canopy-K-means clustering algorithm is proposed.Faced with the shortcomings of the random selection of the center point of the Canopy algorithm,the“maximum and minimum principle”was introduced to optimize the selection of the Canopy center point;then the K-means algorithm was optimized with the help of the triangle inequality theorem to reduce redundant distance calculations and accelerate the convergence rate of the algorithm;finally combined with MapReduce framework parallelization to achieve improved Canopy-K-means algorithm.Based on the constructed Weibo dataset,the optimized Canopy-K-means algorithm is tested.The test results show that the accuracy of the optimized algorithm is about 15%higher than that of the K-means algorithm and about 7%higher than that of the original Canopy-K-means algorithm.The execution efficiency and scalability of the algorithm are also improved.Greatly improved.
作者 王林 贾钧琛 Wang Lin;Jia Junchen(School of Automation and Information Engineering,Xi'an University of Technology,Xi'an 710048,China)
出处 《计算机测量与控制》 2021年第2期176-179,186,共5页 Computer Measurement &Control
基金 陕西省科技计划重点项目(2017ZDCXL-GY-05-03)。
关键词 Canopy-K-means算法 文本聚类 最大最小原则 三角不等式 MAPREDUCE Canopy-K-means algorithm text clustering maximum and minimum principle triangle inequality MapReduce
  • 相关文献

参考文献9

二级参考文献69

  • 1张石磊,武装.一种基于Hadoop云计算平台的聚类算法优化的研究[J].计算机科学,2012,39(S2):115-118. 被引量:29
  • 2刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 3袁方,周志勇,宋鑫.初始聚类中心优化的k-means算法[J].计算机工程,2007,33(3):65-66. 被引量:152
  • 4王玲,薄列峰,焦李成.密度敏感的谱聚类[J].电子学报,2007,35(8):1577-1581. 被引量:61
  • 5Fayyad M, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery., an overview[C]//Advances in Knowledge Discovery and Data Mining. Menlo Park, USA:AAAI Press, 1996:1-34.
  • 6Ester M, Kriegel H P, Sander J, et al. A density based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining. Oregon Portland: AAAI Press, 1996: 226- 231.
  • 7Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases[C]//Proceedings of ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 1998: 73- 84.
  • 8Ankerst M, Breunig M, Kriegel H P, et al. OPTICS: ordering points to identify the clustering structure [C]//Proceedings of ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 1999:49-60.
  • 9Ayad H, Kamel M. Topic discovery from text using aggregation of different clustering methods[C]//Proceedings of the 15th Conference of the Canadian Society for Computational Studies of Intelligence on Advances in Artificial Intelligence.Heidelberg, Germany: Springer-Verlag, 2002:161-175.
  • 10Han Jiawei, Kamber M. Data mining concepts and techniques [ M]. 2nd ed. Beijing:China Machine Press,2006.

共引文献395

同被引文献97

引证文献10

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部