期刊文献+

基于MapReduce的并行遮盖文本聚类算法

The parallel canopy algorithm for text clustering based on MapReduce
下载PDF
导出
摘要 通过研究Hadoop平台和MapReduce编程框架,提出了一个基于MapReduce的并行遮盖文本聚类算法.遮盖算法提出了两个距离阈值T1,T2用来构建重叠子集,避免了传统聚类算法对噪声敏感的缺点.同时采用适当的快速近似距离度量,大大加快了聚类速度.实验表明该算法在MapReduce框架下有良好的集群加速性能,适合处理大规模的数据集. By researching Hadoop platform and MapReduce programming framework, a canopy algorithm for text clustering based on MapReduce was presented. This algorithm proposed two distance threshold 7'1 and 72 to build overlapping subset. It can avoid the shortcomings of the traditional clustering algorithm which is sensitive to noise. At the same time, this algorithm uses an appropriate fast approximate distance metrics and accelerates the clustering speed greatly. The experiments show that it has a good acceleration perform- ance with MapReduce framework, so the algorithm is suitable for handling large data sets.
出处 《内蒙古科技大学学报》 CAS 2013年第3期273-277,共5页 Journal of Inner Mongolia University of Science and Technology
基金 内蒙古自然科学基金资助项目(2012MS0912) 内蒙古教育厅科研资助项目(Njzy12110)
关键词 文本聚类 遮盖算法 HADOOP MAPREDUCE document clustering canopy algorithm hadoop mapreduce
  • 相关文献

参考文献5

  • 1Dean J,Ghemawat S. MapReduce: simplified data process-ing on large clusters [J]. Communications of the ACM,2008,51(1):107-113.
  • 2Hadoop W T. The definitive guide [ M ]. Sebastopol :0,Reilly Media,Inc. ,2012.
  • 3McCallum A,Nigam K,Ungar L H. Efficient clustering ofhigh-dimensional data sets with application to referencematching[ A]. Proceedings of the sixth ACM SIGKDD in-ternational conference on knowledge discovery and datamining[ C]. USA : ACM ,2000 : 169-178.
  • 4Dean J,Ghemawat S. MapReduce : simplified data process-ing on large clusters[J]. Communications of the ACM,2008,51(1):107-113.
  • 5复曰大学中文语料库[EB/OL].http://www. nip. org.cn,2008 -06-21 .

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部