摘要
针对FSDP聚类算法在计算数据对象的局部密度与最小距离时,由于需要遍历整个数据集而导致算法整体时间复杂度较高的问题,提出了一种基于Spark的并行FSDP聚类算法SFSDP。首先,通过空间网格划分将待聚类数据集划分成多个数据量相对均衡的数据分区;然后,利用改进的FSDP聚类算法并行地对各个分区内的数据执行聚类分析;最后,通过将分区间的局部簇集合并,生成全局簇集。实验结果表明,SFSDP与FSDP算法相比能够有效地进行大规模数据集的聚类分析,并且算法在准确性和扩展性方面都有很好的表现。
In view of the problem that the overall time complexity of the FSDP clustering algorithm was high because the algorithm needed to traverse the entire data set when calculating the local density and minimum distance of data objects,this paper presented a Spark-based parallel FSDP clustering algorithm called SFSDP.First,the algorithm divided the dataset into multiple data partitions with relatively equal size by spatial meshing.Then,it used the improved FSDP clustering algorithm to perform the clustering analysis on the data in each partition parallelly.It generated the global clusters by grouping together local clusters between partitions.Experimental results show that SFSDP algorithm can effectively perform large-scale dataset clustering analysis compared with FSDP algorithm,and the algorithm has a good performance in terms of accuracy and scalability.
作者
孙伟鹏
吴锡生
孟斌
Sun Weipeng;Wu Xisheng;Meng Bin(School of IoT Engineering,Jiangnan University,Wuxi Jiangsu 214122,China;Software Engineering Center,China Ship Scientific Research Center,Wuxi Jiangsu 214082,China)
出处
《计算机应用研究》
CSCD
北大核心
2020年第1期163-166,171,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(61672265)
七〇二所青年创新基金资助项目(J775).