基于MapReduce的并行遮盖文本聚类算法

The parallel canopy algorithm for text clustering based on MapReduce

下载PDF

导出

摘要通过研究Hadoop平台和MapReduce编程框架,提出了一个基于MapReduce的并行遮盖文本聚类算法.遮盖算法提出了两个距离阈值T1,T2用来构建重叠子集,避免了传统聚类算法对噪声敏感的缺点.同时采用适当的快速近似距离度量,大大加快了聚类速度.实验表明该算法在MapReduce框架下有良好的集群加速性能,适合处理大规模的数据集. By researching Hadoop platform and MapReduce programming framework, a canopy algorithm for text clustering based on MapReduce was presented. This algorithm proposed two distance threshold 7＇1 and 72 to build overlapping subset. It can avoid the shortcomings of the traditional clustering algorithm which is sensitive to noise. At the same time, this algorithm uses an appropriate fast approximate distance metrics and accelerates the clustering speed greatly. The experiments show that it has a good acceleration perform- ance with MapReduce framework, so the algorithm is suitable for handling large data sets.

作者张亚楠谭跃生

机构地区内蒙古科技大学信息工程学院内蒙古科技大学工程训练中心

出处《内蒙古科技大学学报》 CAS 2013年第3期273-277,共5页 Journal of Inner Mongolia University of Science and Technology

基金内蒙古自然科学基金资助项目(2012MS0912) 内蒙古教育厅科研资助项目(Njzy12110)

关键词文本聚类遮盖算法 HADOOP MAPREDUCE document clustering canopy algorithm hadoop mapreduce

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1Dean J,Ghemawat S. MapReduce: simplified data process-ing on large clusters [J]. Communications of the ACM,2008,51(1):107-113.
2Hadoop W T. The definitive guide [ M ]. Sebastopol :0,Reilly Media,Inc. ,2012.
3McCallum A,Nigam K,Ungar L H. Efficient clustering ofhigh-dimensional data sets with application to referencematching[ A]. Proceedings of the sixth ACM SIGKDD in-ternational conference on knowledge discovery and datamining[ C]. USA : ACM ,2000 : 169-178.
4Dean J,Ghemawat S. MapReduce : simplified data process-ing on large clusters[J]. Communications of the ACM,2008,51(1):107-113.
5复曰大学中文语料库[EB/OL].http://www. nip. org.cn,2008 -06-21 .

1李建威.二级骨干网可靠性研究[J].软件导刊,2008,7(5):147-149.
2朱述龙.快速近似主成分分析算法[J].遥感学报,1999,3(1):43-47. 被引量：3
3徐奔,周志湖,范良忠.基于AKAZE特征的复杂抖动数字视频稳像算法[J].计算机工程,2016,42(7):251-256. 被引量：6
4郭强,吴成东,赵迎春.基于在线判别分布域特征选择的鲁棒跟踪算法[J].东北大学学报（自然科学版）,2017,38(3):305-309.
5李国栋,胡云卿,刘兴高.一种高效的快速近似控制向量参数化方法[J].自动化学报,2015,41(1):67-74. 被引量：2
6李超,陈武凡.一种基于正则化参数自适应选择的快速近似求逆的图像恢复新算法[J].计算机应用与软件,2000,17(6):31-37. 被引量：1
7史淼晶,徐蕊鑫,许超.用于视觉词语生成的概率预测器[J].中国图象图形学报,2013,18(6):706-710.
8顾王一,朱林,杨杰.快速近似聚类算法及其在图像检索中的应用[J].上海交通大学学报,2011,45(2):149-153. 被引量：4
9解仑,金朝东,张宝成,王志良,朱海培.神经元控制器在交流伺服系统中的应用[J].矿冶,1998,7(1):58-62.
10宋道金.单神经元自适应PID控制器的性能优化设计[J].计算机工程与应用,2007,43(12):199-201. 被引量：17

内蒙古科技大学学报

2013年第3期

浏览历史

内容加载中请稍等...

基于MapReduce的并行遮盖文本聚类算法

参考文献5

相关作者

相关机构

相关主题

浏览历史