摘要
传统的二分K均值算法在二分过程中采用随机选择的方式选择聚类质心,为了选择到理想的质心,需要重复选择多次,这种方式的时间代价比较大。为此,本文提出采用极大距离点作为质心的方式,有效的降低了时间复杂度,同时运用点抽样的方法避免离群点带来的影响。同时,考虑到随着时代的发展,我们面临的数据量会越来越大,本文提出了一种基于Hadoop分布式平台的并行二分K均值算法,实验表明,该并行算法能够获得较为理想的加速比。
The traditional Bisecting K-Means clustering algorithm adopts the random mode in selecting the initial centroid. Multiple repeated selections are needed in order to select the ideal centroid, which is extremely time consuming. For this reason,the paper proposes to select the two patterns with distance maximum as the initial cluster centroid,which effectively accelerates the clustering. We also adopt point sampling to avoid the influence of outliers.Considering the development of the times,the amount of data facing us will be larger and larger,so we propose a parallelism algorithm based on Hadoop. Experimental results show that the algorithm gets ideal speedup performance and efficiency.
出处
《科技广场》
2016年第9期4-8,共5页
Science Mosaic
基金
国家自然基金项目"基于深度信息和显著计算的手势交互技术研究与应用"(编号:61363046)
立项作者:杨文姬