摘要
模糊K-means算法是一种能够定量地确定事物亲属关系的软聚类算法,由于该算法在大规模数据的分析和处理中存在的不足,因此提出一种基于MapReduce模型的并行化实现。首先在Map函数的输出传递给其他节点的Reduce函数之前,改进Combine函数设计,增加本地中间结果处理,减少通信开销,以提高MapReduce任务计算速度。然后在Hadoop分布式计算平台上对多组规模不同的数据集进行测试。实验表明,基于MapReduce的并行模糊K-means算法适合大规模数据的分析和处理,而且执行速度提高了约1.9倍,聚类效果更为显著。
The fuzzy K-means algorithm is a kind of important soft clustering algorithm which can quantitatively determine the relation of different objects.In view of the shortcomings of fuzzy K-means algorithm in large-scale data processing,therefore,this paper puts forward parallel implementation based on MapReduce programming model.First,in order to improve the computing speed of the MapReduce task,it can improve the design of the Combine function,add the local intermediate result processing and reduce the communication overhead before the output of the Map function is passed to the Reduce function of other nodes.Then,several sets of data sets with different sizes are tested on the Hadoop distributed computing platform.The experiments show that the parallel fuzzy K-means algorithm based on MapReduce is suitable for the analysis and processing of large-scale data,and the execution speed is increased by about 1.9 times,and the clustering effect is more remarkable.
作者
杨延庆
袁华兵
YANG Yanqing;YUAN Huabing(Division of Information Technology,Xi'an Medical University,Xi'an 710021)
出处
《计算机与数字工程》
2020年第7期1564-1567,1765,共5页
Computer & Digital Engineering
基金
陕西省青年科学基金项目(编号:71701160)
西安医学院教学改革研究项目(编号:2018JG-07)资助。