摘要
通过研究Hadoop平台和MapReduce编程框架,提出了一个基于MapReduce的并行遮盖文本聚类算法.遮盖算法提出了两个距离阈值T1,T2用来构建重叠子集,避免了传统聚类算法对噪声敏感的缺点.同时采用适当的快速近似距离度量,大大加快了聚类速度.实验表明该算法在MapReduce框架下有良好的集群加速性能,适合处理大规模的数据集.
By researching Hadoop platform and MapReduce programming framework, a canopy algorithm for text clustering based on MapReduce was presented. This algorithm proposed two distance threshold 7'1 and 72 to build overlapping subset. It can avoid the shortcomings of the traditional clustering algorithm which is sensitive to noise. At the same time, this algorithm uses an appropriate fast approximate distance metrics and accelerates the clustering speed greatly. The experiments show that it has a good acceleration perform- ance with MapReduce framework, so the algorithm is suitable for handling large data sets.
出处
《内蒙古科技大学学报》
CAS
2013年第3期273-277,共5页
Journal of Inner Mongolia University of Science and Technology
基金
内蒙古自然科学基金资助项目(2012MS0912)
内蒙古教育厅科研资助项目(Njzy12110)