期刊文献+

基于MapReduce的分布式近邻传播聚类算法 被引量:53

Distributed Affinity Propagation Clustering Based on MapReduce
下载PDF
导出
摘要 随着信息技术迅速发展,数据规模急剧增长,大规模数据处理非常具有挑战性.许多并行算法已被提出,如基于MapReduce的分布式K平均聚类算法、分布式谱聚类算法等.近邻传播(affinity propagation,AP)聚类能克服K平均聚类算法的局限性,但是处理海量数据性能不高.为有效实现海量数据聚类,提出基于MapReduce的分布式近邻传播聚类算法——DisAP.该算法先将数据点随机划分为规模相近的子集,并行地用AP聚类算法稀疏化各子集,然后融合各子集稀疏化后的数据再次进行AP聚类,由此产生的聚类代表作为所有数据点的聚类中心.在人工合成数据、人脸图像数据、IRIS数据以及大规模数据集上的实验表明:DisAP算法对数据规模有很好的适应性,在保持AP聚类效果的同时可有效缩减聚类时间. With the rapid development of computer technology, data grows explosively. There are challenges for the traditional machine learning algorithms to deal with the large scale data. Many parallel algorithms have been proposed to address the scalability problem, such as MapReduce-based K-means algorithm and parallel spectral clustering algorithm. Affinity propagation (AP) clustering algorithm is introduced to address some drawbacks of the traditional clustering methods such as K- means algorithm. However, its scalability and performance still need improving when dealing with large scale data. In this paper, we propose a distributed AP clustering algorithm based on MapReduce, named DisAP. At first, large scale data are partitioned into several smaller subsets randomly. Then each subset is sparsified in parallel by using AP clustering algorithm. The results are fused and then clustered again, which forms a set of high-quality exemplars. Finally, all data are assigned to exemplars in parallel. DisAP is implemented on a Hadoop cluster, and the experiments on synthetic datasets,human face image datasets, and IRIS dataset demonstrate that DisAP can achieve high performance on both scalability and accuracy.
出处 《计算机研究与发展》 EI CSCD 北大核心 2012年第8期1762-1772,共11页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60673088) 国家自然科学基金重大研究计划重点基金项目(90920303) 中央高校基本科研业务费专项基金项目(KYJD09015) 中国博士后科学基金项目(20110491781)
关键词 近邻传播聚类 分布式计算 MAPREDUCE 数据划分 聚类融合 affinity propagation cluster ensemble elustering distributed computing MapReduce data partition
  • 相关文献

参考文献2

二级参考文献20

  • 1Han Jiawei, Micheline. Data Mining: Concepts and Techniques.San Francisco: Morgan Kaufmann Publishers, 2000.
  • 2M. Ester, HP. Kriegel, J. Sander, et al. A density based algorithm of discovering clusters in large spatial databases with noise. In: E. Simoudis, Han Jiawei, U. M. Fayyad, eds. Proc.the 2nd Int'l Conf. Knowledge Discovery and Data Mining Portland. Menlo Park, CA: AAAI Press, 1996. 226~231.
  • 3Tian Zhang, Raghu Ramakrishnan, Miron Livny. BIRCH: An efficient data clustering method for very large databases. In: Proc.ACM SIGMOD Int'l Conf. Management of Data. New York:ACM Press, 1996. 73~84.
  • 4S. Guha, R. Rostogi, K. Shim. CURE: An efficient clustering algorithm for large databases. In: L. M. Haas, A. Tiwary, eds.Proc. the ACM SIGMOD Int'l Conf. Management of Data Seattle. New York: ACM Press, 1998. 73~84.
  • 5W. Zhnn, et al. Muntz. STING: A statistical information grid approach to spatial data mining. In: Proc. 23rd VLDB Conf.,San Francisco: Morgan Kaufrnann, 1997. 186~195.
  • 6S. Kantabutra, A. L. Couch. Parallel k-means clustering algorithm on Nows. NECTEC Technical Journal, 1999, 1 ( 1 ) :243~ 247.
  • 7Manasi N. Joshi. Parallel k-means algorithm on distributed memory multiprocessors. http:∥www. cs. umn. edu/~mnjoshi/PKMeans. pdf, 2003.
  • 8C. Pizzuti, D. Talia. P-Autoclass: Scalable parallel clustering for mining large data sets. IEEE Trans. Knowledge and Data Engineering, 2003, 15(6): 629~641.
  • 9O. Egecioglu, H. Ferhatosmanoglu, U. Ogras. Dimensionality reduction and similarity computation by inner-product approximates. IEEE Trans. Knowledge and Data Engineering,2004, 16(6): 714~726.
  • 10Maria Halkidi, Michalis Vazirgiannis. Clustering validity assessment: Finding the optimal partitioning of a data set. IEEE Int'l Conf. Data Mining, California, 2001.

共引文献28

同被引文献472

引证文献53

二级引证文献225

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部