摘要
模糊C均值(FCM,fuzzzy C-Means)算法是一种在大数据分析领域广泛使用的聚类算法,由于FCM的聚类结果和聚类速度很大程度上取决于初始聚类中心,因此给出一种Canopy-FCMBM改进算法。首先运用Canopy算法生成聚类中心和聚类数量,并以此结果作为FCM算法的初始聚类中心,从而解决确定聚类数目困难和随机初始聚类中心容易产生局部最优解等问题。针对数据存在多维度且分布不均匀的问题,将FCM算法目标函数距离度量方式由欧几里德距离替换为马哈拉诺比斯距离。最后通过Spark编程模型实现Canopy-FCMBM算法的并行化处理,提高算法执行效率。结果表明,相比较传统的FCM算法,基于Spark的Canopy-FCMBM算法聚类准确率提升12.7%,聚类速度提升1.35倍,聚类效果更优。
Fuzzy C-Means (FCM) algorithm is a clustering algorithm widely used in the field of big data analysis.Since the clustering results and speed of FCM depend largely on the initial clustering center,an improved Canopy-FCMBM algorithm is proposed in this paper. Firstly,the Canopy algorithm is used to generate the cluster center and the number of clusters,and the result is used as the initial clustering center of the FCM algorithm,so as to solve the problem that it is difficult to determine the number of clusters,and that randomly determining the initial clustering center leads to the local optimal solution. In view of the multi-dimensional and uneven distribution of data, the distance measurement method of FCM is replaced by the Mahalanobis distance. Finally, the parallelization processing on Spark programming model is realized to improve the algorithm execution efficiency. Compared with the traditional FCM algorithm,the experimental results show that the clustering accuracy of the improved algorithm increases by 12.7%,the clustering speed increases by 1.35 times,and thus the clustering effect is better than before.
作者
夏邢
薛涛
李婷
XIA Xing;XUE Tao;LI Ting(School of Computer Science,Xi′an Polytechnic University,Xi′an 710048,China)
出处
《西安工程大学学报》
CAS
2019年第1期100-105,共6页
Journal of Xi’an Polytechnic University
基金
陕西省自然科学基础研究计划一般项目(2018JQ6103)