期刊文献+

基于MapReduce的Canopy-Kmeans改进算法 被引量:65

Improved Canopy-Kmeans algorithm based on MapReduce
下载PDF
导出
摘要 针对分布式Canopy-Kmeans算法中Canopy选取的随机性问题,采用"最小最大原则"对该算法进行了改进,避免了Cannopy选取的盲目性;采用MapReduce并行计算框架对算法进行了并行扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景。以海量互联网新闻信息聚类作为应用背景,对改进后的算法进行了实验分析。实验结果表明:该方法较随机挑选Canopy策略在分类准确率以及抗噪能力上都明显提高,而且在处理海量数据时表现出较大的性能优势。 In order to solve the problem that how to void random Canopy selection of Canopy-Kmeans algorithm, this paper introduces an improved algorithm based on the minimum and maximum principle and realizes processing massive data based on MapReduce framework. Meanwhile, the algorithm is carried out in massive Internet news ag- gregation. The experiments show that the strategy of Canopy selection based on the minimum and maximum princi- ple has higher classification accuracy and noise immunity compared to random strategy.
作者 毛典辉
出处 《计算机工程与应用》 CSCD 2012年第27期22-26,68,共6页 Computer Engineering and Applications
基金 国家自然科学基金(No.2009ZX05038-001) 北京市属高等学校科学技术与研究生教育创新工程建设项目(No.PXM2012_014213_000037)
关键词 Canopy-Kmeans算法 MAPREDUCE 分布式聚类 Canopy-Kmeans MapReduce distributed aggregation
  • 相关文献

参考文献11

  • 1Han Jiawei,Kamber M.Data mining:concepts and tech- niques[M].San Francisco:Morgan Kaufmann Publishers, 2000.
  • 2李榴,唐九阳,葛斌,肖卫东,汤大权.k-DmeansWM:一种基于P2P网络的分布式聚类算法[J].计算机科学,2010,37(1):39-41. 被引量:6
  • 3Januzaj E, Kriegel H P, Pfeifle M.DBDC : Density-Based Distributed Clustering[C]//Proceedings of 9th International Conference on Extending Database Technology(EDBT). Oakland: IEEE Computer Press, 2004 : 88-105.
  • 4Samatova N F, Ostrouchov G.RACHET : an efficient cov- er-based merging of clustering hierarchies from distribut- ed datasets[J].Distributed and Parallel Databases,2002, 11 (2) : 157-180.
  • 5Johoson E, KarguPta H.Collective, hierarchical clustering from distributed, heterogeneous data[C]//Lecture Notes in Computer Science.Berlin: Springer, 2000 : 221-244.
  • 6Kargupta H.Sclable, distributed data mining using an agent based architecture[C]//Proceedings of 3rd Interna- tional Conference on Knowledge Discovery and Data Mining.Oakland .. AAAI Press, 1997 .. 211-214.
  • 7刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 8岑咏华,王晓蓉,吉雍慧.一种基于改进K-means的文档聚类算法的实现研究[J].现代图书情报技术,2008(12):73-79. 被引量:7
  • 9Hearst M A.Texttiling: segmenting text into multi-para- graph subtopic passages[J].Computational Linguistics, 1997,23(1) :33-64.
  • 10Dean J, Ghemawat S.MapReduce-simplified data process- ing on large clusters[C]//Proceedings of the 6th Inter- national Conference on Operation Systems Design & Im- plementation(OSDI), Berkeley, CA, USA, 2004 : 137-150.

二级参考文献33

  • 1郑苗苗,吉根林.DK-Means——分布式聚类算法K-Dmeans的改进[J].计算机研究与发展,2007,44(z2):84-88. 被引量:9
  • 2刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 3刘远超,王晓龙,徐志明,关毅.文档聚类综述[J].中文信息学报,2006,20(3):55-62. 被引量:65
  • 4Hatzivassiloglou V, Klavans J L, Holcombe M L, et al.Simfinder: A flexible clustering tool for surmnarization. In: Proceedings of the NAACI, 2001 Workshop on Automatic Surrunarization, Pittsburgh, PA, 2001, 41-49 .
  • 5Jain A K,Dubes R C. Algorithms for clustering data. Englewood Cliffs NJ, USA: Prentice Hall, 1988.
  • 6Sneath P H, Sokal R R. Numerical Taxonomy. London, UK:Freeman. 1973.
  • 7King B. Step-wise clustering procedures. Journal of the Amercian Statistical Association , 1967, 69(8) :86-101.
  • 8Guha S, Rastogi R, Shim K. CURE: An efficient clustering algorithm for large databases. Information Systems, 2001, 26( 1 ) : 35-58.
  • 9Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. In : Proceedings of the 15th International Cotfference on Data Engineering. Sydney: IEEE Computer Society Press, 1999. 512-521.
  • 10Karypis G, Han E H, Kumar V. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 1999, 32(8) :68-75.

共引文献32

同被引文献499

引证文献65

二级引证文献415

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部