摘要
针对现有算法在大数据背景下聚类效果差,以及由于迭代等原因导致处理性能低等问题,提出一种在Hadoop平台高效并行的聚类算法PAClustering.先提出一种基于权重的方法,将整体数据按分布划分成若干数据块,并针对每个数据块将紧凑的数据抽象成一个向量,形成微团,最后通过树形合并进行聚类,在提高聚类准确度的同时有效避免了传统算法在聚类过程中的迭代运算.在不同大小数据集上的实验表明,PAClustering算法不仅具有较高的聚类准确度和稳定性,同时具有良好的处理性能.
As the existed algorithms have poor clustering effect under the background of big data, and their processing performance is poor because of the iteration, this paper proposes a kind of efficient parallel clustering algorithm on Hadoop platform. According to the distribution, we firstly propose a weight-based idea to partition the dataset into a number of data blocks, then divide each data block into many groups in which the compact data will be gathered as a vector. Finally arborescence merge is applied to clustering. The new algorithm improves the clustering accuracy and avoids the iterative operation in clustering process. Experimental on different size of datasets show that this algorithm not only has higher accuracy and stability of clustering, also has good processing perform- alice.
出处
《小型微型计算机系统》
CSCD
北大核心
2016年第8期1770-1774,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金青年项目(61402053)资助
湖南省科技计划项目(2014SK3080)资助
湖南省教育厅优秀青年项目(14B005)资助
关键词
大数据
HADOOP
并行聚类
微团
树形合并
big data
Hadoop
parallel clustering
micro-cluster
arborescence merge