期刊文献+

多源信息数据的并行优化抽样聚类K-means算法 被引量:10

11-Means Algorithm for Parallel Optimal Sampling Clustering of Multi-source Information Data
下载PDF
导出
摘要 为了解决K-means算法在面对多源信息数据时,无法对数据集合准确聚类,且处理效率较低等问题,以多源信息数据的特点和传统算法为依据,研究得到一种多源信息数据的并行优化抽样聚类K-means算法。算法利用特征函数和模糊分类中理想的划分函数,对多源信息数据合理化预处理,同时基于Map Reduce模型和Canopy算法定义,获得相同数据并分类到相同子集内,然后采用BK-means算法对Canopy子集达成聚类分析,通过抽样策略令数据空间形成同宽的窗格,根据其范围内点数与最小点数的比较及间距值的确定,对数据点实施去除,通过最大最小距离方法对新的聚类中心和模式进行选择,最后令并行优化抽样聚类K-means算法得到有效的实现。经过仿真,上述算法不仅并行性较好,聚类精准度较高,并且具备极佳的鲁棒性和收敛性,处理时长有明显的缩短。 At present,K-means algorithm cannot cluster the data set accurately in processing multi-source information data,and the processing efficiency is low.Therefore,this article puts forward a K-means algorithm for parallel optimized sampling and clustering of data multi-source information based on the characteristics of multi-source information data and traditional algorithms.This algorithm used the feature function and the ideal partition function in fuzzy classification to rationalize the multi-source information data.Based on Map Reduce model and Canopy algorithm definition,the same data were obtained and divided into the same subset.After that,BK-means algorithm was adopted for cluster analysis of Canopy subset.According to the sampling strategy,the data space was formed into a pane with the same width.Furthermore,the data points were removed based on the comparison between the number of points within its range and the minimum number of points and the determination of distance value.In addition,new clustering center and mode were selected by the method of maximum-minimum distance.Finally,the parallel optimization sampling clustering K-means algorithm was effectively implemented.Simulation results verify that the proposed algorithm not only has good parallelism and high clustering accuracy,but also has excellent robustness and convergence,so that the processing time is significantly shortened.
作者 杨晓梅 YANG Xiao-mei(College of Information Management,Xinjiang University of Finance and Economics,Urumqi Xinjiang 830012,China)
出处 《计算机仿真》 北大核心 2020年第7期305-308,332,共5页 Computer Simulation
基金 2017年度教育部人文社会科学研究规划基金项目(17XJJAZH001)。
关键词 多源信息数据 收敛性 聚类中心 欧几里得度量 Multi-source information data Convergence Cluster center Euclidean metric
  • 相关文献

参考文献12

二级参考文献99

共引文献81

同被引文献151

引证文献10

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部