摘要
谱聚类算法是基于谱图分割理论的聚类方法,其对高维、非凸数据分布问题有很好的聚类效果。但对大规模数据问题的聚类,该方法存在着计算时间和存储空间等方面的瓶颈。本文给出了一个自适应的谱聚类并行算法,通过局部计算和异步循环通信并行方法,最大限度减少了并行谱聚类中数据通信次数,并通过计算与通信重叠策略,进一步降低了并行算法的通信开销。在并行算法实现中,将自主开发的最优预条件共轭梯度法并行求解器PLOBPCG用于谱聚类的特征降维。在中科院的"元"超级计算机上,通过对两类大规模数据聚类的测试表明,在2048核上的加速比接近线性加速,并行效率达到96%以上。
Spectral clustering algorithm is a clustering method based on spectral segmentation theory,which has a good clustering effect on high-dimensional and non-convex data distribution problems.However,there are some bottlenecks in the clustering of large scale data,such as computation time and storage space.In this paper,a parallel self-tuning spectral clustering algorithm is proposed.By means of local calculation and asynchronous loop communication,the number of data communication in parallel spectral clustering is minimized.In order to further reduce the communication overhead,a method of computing and communication overlap is used.In the parallel algorithm implementation,the self-developed locally optimal block preconditioned conjugate gradient method parallel solver PLOBPCG is applied to the feature reduction of spectral clustering.On the supercomputer"Era"of Chinese Academy of Sciences,the clustering test of two types of large-scale data shows that the speedup ratio of the 2048 core is nearly linear,and the parallel efficiency is above 96%.
作者
苏琳
赵永华
李瑞琳
Su Lin Zhao Yonghua Li Ruilin(University of Chinese Academy of Sciences, Beijing 100049, China Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China)
出处
《科研信息化技术与应用》
2016年第6期44-53,共10页
E-science Technology & Application
基金
数学工程与先进计算国家重点实验室开放基金(2014A03)
关键词
谱聚类
并行算法
自适应
PLOBPCG
循环通信
spectral clustering
parallel algorithm
self-tuning
PLOBPCG
loop communication