摘要
分布式计算为大数据的处理提供一种新的平台,能有效提升算法的执行速度。在DBSCAN算法基础上提出一种数据分网格算法,该算法将每个分区上的数据集划分成以Eps半径为边长的单元格数据块,将查找Eps邻域的范围缩小到数据对象的八个相邻单元格之内,从而提高查找Eps邻域的速度及聚类速度,具有较好的加速比和扩展率。同时还优化分区聚类合并方法。
Distributed computing for large data processing provides a new platform which can effectively improve the speed of the algorithm. Based on DBSCAN algorithm, proposes a data sub-grid algorithm, which divides the data of each partition into cell data block with Eps radius as the side length, to reduce the search for Eps neighborhood range data objects within eight adjacent ceils, so as to improve the speed of Eps neighborhood search and the clustering speed, good speed ratio and extension ratio. At the same time, optimizes the partition clustering consolidation methods.