期刊文献+

云环境下基于数据流的k-means聚类算法 被引量:12

Algorithm for k-means Based on Data Stream in Cloud Computing
下载PDF
导出
摘要 k-means算法是一种最常用的基于划分的聚类算法。传统的集中式k-means算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-means算法成为了目前亟需解决的问题。现有分布式k-means算法基于MapReduce计算框架且没有考虑初始聚类中心的影响。由于每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模。在该框架的基础上,提出了一种高效的k-means算法,它采用基于多次采样的初始聚类中心选取方法来实现负载均衡及减少迭代次数。实验结果表明,该算法的可扩展性较好,且效率比现有算法高。 k-means algorithm is one of the most commonly used clustering algorithm. Now data scale is exploding, and traditional centralized algorithm can not meet the requirements, so it is an urgent problem to design distributed k-means clustering algorithm currently. Existing distributed k-means algorithms are based on MapReduce framework and don't consider the clustering center. Since each MapReduce job reads and writes data from distributed file system, it is ineffi- cient to express dependencies between jobs. Then this paper proposed a framework based on data stream. Based on Ma- pReduce framework, this framework models according to the data flow diagram. And it proposed an efficient k-means al- gorithm on the framework. It uses an improved algorithm based on sampling to confirm clustering center for load ba- lance and reducing iterations. Experimental results demonstrate that the algorithm can efficiently resolve the large scale k-means cluster.
出处 《计算机科学》 CSCD 北大核心 2015年第11期235-239,265,共6页 Computer Science
基金 国家自然科学基金项目(61373015 61300052) 国家教育部高等学校博士学科点专项科研基金资助项目(20103218110017) 江苏高校优势学科建设工程资助项目(PAPD) 中央高校基本科研业务费专项项目(NP2013307) 云计算-南航-大数据处理引擎技术研究项目资助
关键词 K-MEANS MAPREDUCE 计算框架 数据流 k-means, MapReduce, Framework, Data stream
  • 相关文献

参考文献18

  • 1Han Jia-wei, Kamber M. Data mining concepts and techniques,second edition[M]. Elsevier (Singapore) Pte Ltd,2006:251-263.
  • 2Kriegel H P,Kroger P, Zimek A, Clustering high-dimensionaldata: A survey on subspace clustering.pattern-based clustering,and correlation clustering[J]. ACM Transactions on KnowledgeDiscovery from Data (TKDD) .2009,3(1) : 1.
  • 3Forgy E. Cluster analysis of multivariate data: Efficiency vs. In-terpretability of classifications [J]. Biometrics, 1965,21(3):768.
  • 4Malewicz G,Austern M H,Bik A J C, et al. Pregel: a system forlarge-scale graph processing [C] // Proceedings of the 2010ACM SIGMOD International Conference on Management of Da-ta. ACM,2010:135-146.
  • 5Wang J, Su X. An improved K-Means clustering algorithm[C]//2011 IEEE 3rd International Conference on CommunicationSoftware and Networks (ICCSN). IEEE,2011:44-46.
  • 6Kumar N S, Rao K N,Govardhan A,et al. Undersampled K-means approach for handling imbalanced distributed data[J].Progress in Artificial Intelligence,2014,3(1) : 1-10.
  • 7Dean J, Ghemawat S. MapReduce: simplified data processing onlarge clusters [J]. Communications of the ACM, 2008,51(1):107-113.
  • 8Apache. Hadoop[EB/OL], (2014-4-10)[2014-4-22]. http://ha-doop. apache, org/.
  • 9Ordonez C. Omiecinski E. Efficient disk-based K-means cluste-ring for relational databases [J]. IEEE Transactions on Know-ledge and Data Engineering,2004,16(8) :909-921.
  • 10Pelleg D, Moore A W. X-means: Extending K-means with Effi-cient Estimation of the Number of Clusters [C] // ICML. 2000:727-734.

二级参考文献31

  • 1潘锐,朱大铭,马绍汉,肖进杰.k-Median近似计算复杂度与局部搜索近似算法分析[J].软件学报,2005,16(3):392-399. 被引量:8
  • 2[11]J Peng,Y Xia.A new theoretical framework for k-means-type clustering.McMaster University,Advanced Optimization Laboratory,Tech Rep:ADVOL2004/06,2004
  • 3[12]J Peng,Y Xia.A cutting algorithm for the minimum-sum-of-squared error clustering.SIAM Int'lConf on Data Mining,Newport Beach,CA,2005
  • 4[1]M Inaba,N Kaoth,H Imai.Application of weighted Voronoi diagrams and randomization to variance-based k-clustering(extended abstract).In:Proc of the 10th Annual Symp on Computational Geometry.New York:ACM Press,1994.332-339
  • 5[2]V Arya,et al.Local search heurictics for k-median and facility location problems.STOC'2001,Hersonissons,Crete,Greece,2001
  • 6[4]T Kanungo,D M Mount,N Netanyahu,et al.A local search approximation algorithm for k-means clustering.Computational Geometry,2004,28:89-112
  • 7[5]J Matousek.On approximate geometric k-clustering.Discrete and Computational Geometry,2004,24:61-84
  • 8[6]M Song,S Rajasekaran.Fast k-means algorithms with constant approximation.The 16th Annual Int'lSymp on Algorithms and Computation,Sanya,Hainan,2005
  • 9[7]A Kumar,Y Sabharwal,S Sen.A sample linear time (1+ε) algorithm for k-means clustering in any dimensions.In:Proc of the 45th FOCS.Piscataway,NJ:IEEE Press,2004.454-462
  • 10[8]R Ostrovsky,Y Rabani,L J Schulman.The effectiveness of lloyd-type metohds for the k-means problem.The 47th Annual IEEE Symp on the Foundations of Computer Science FOCS'06,Berkeley,CA,2006

共引文献124

同被引文献90

引证文献12

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部