摘要
聚类是将不同对象的集合分割为由相似对象组成的多个不同类的过程,是最重要的数据挖掘技术之一.然而,对于大数据聚类却是一个复杂的问题.由于大数据体量庞大,聚类算法时间消耗巨大.并行是解决算力不足的一个非常好的方法.据此,本文采用了Hadoop平台上的MapReduce来实现大规模数据集的并行运算,将大数据聚类问题的时间复杂度限制到一个可以接受的范围内.最后本文从时间消耗和聚类精确度方面对该方法的性能收益进行了评估,在保证较高精确度的同时大大提高了运算速度.
Clustering is one of the most important techniques in data mining, which is based on the many different processes that are composed of similar objects. However, for big data clustering is a complex problem. Because of the huge amount of data,the clustering algorithm is time-consuming. Parallel is a very good method to solve the problem of insufficient force. Based on this, Hadoop MapReduce is used to achieve the parallel operation of big data sets. The time complexity of big data clustering problem is limited to an acceptable range. At last, the performance gains of the method are evaluated from the time consumption and clustering accuracy, which can greatly improve the running speed.
作者
郭晨晨
朱红康
GUO Chenchen ZHU Hongkang(School of Mathematics and Computer Science, Shanxi Normal University, Linfen 041000, China)
出处
《鲁东大学学报(自然科学版)》
2017年第1期31-35,共5页
Journal of Ludong University:Natural Science Edition
基金
山西省自然科学基金(2015011040)