期刊文献+

基于MapReduce的大规模文本聚类并行化 被引量:9

Parallel clustering of very large document datasets with MapReduce
原文传递
导出
摘要 建立快速有效的针对大规模文本数据的聚类分析方法是当前数据挖掘研究和应用领域中的一个热点问题.为了同时保证聚类效果和提高聚类效率,提出基于"互为最小相似度文本对"搜索的文本聚类算法及分布式并行计算模型.首先利用向量空间模型提出一种文本相似度计算方法;其次,基于"互为最小相似度文本对"搜索选择二分簇中心,提出通过一次划分实现簇质心寻优的二分K-means聚类算法;最后,基于MapReduce框架设计面向云计算应用的大规模文本并行聚类模型.在Hadoop平台上运用真实文本数据的实验表明:提出的聚类算法与原始二分K-means相比,在获得相当聚类效果的同时,具有明显效率优势;并行聚类模型在不同数据规模和计算节点数目上具有良好的扩展性. To develop fast and efficient methods to cluster mass document data is one of the hot issues of current data mining research and applications. In order to ensure the clustering result and simultaneously improve the clustering efficiency, a document clustering algorithm was proposed based on searching a document pair with minimum similarity for each other and its distributed parallel computing models were provided. Firstly a document similarity measure was presented using a vector space model (VSM); then bisecting clustering was raised combining the bisecting K-means and the proposed initial cluster center selection approach to find the optimized cluster centroids by once partitioning; finally a distributed parallel document clustering model was designed for cloud computing based on MapReduce framework. Experiments on Hadoop platform, using real document datasets, showed the obvious efficiency advantages of the novel document clustering algorithm compared to the original bisecting K-means with an equivalent clustering result, and the scalability of parallel clustering with different data sizes and different computation node numbers was also evaluated.
出处 《北京科技大学学报》 EI CAS CSCD 北大核心 2014年第10期1411-1419,共9页 Journal of University of Science and Technology Beijing
基金 国家自然科学基金资助项目(71271027) 高等学校博士学科点专项科研基金资助项目(20120006110037) 中央高校基本科研业务费专项资金资助项目(FRF-TP-10--006B)
关键词 云计算 文本 聚类 相似度 cloud computing documents clustering similarity
  • 相关文献

参考文献28

  • 1管仁初,裴志利,时小虎,杨晨,梁艳春.权吸引子传播算法及其在文本聚类中的应用[J].计算机研究与发展,2010,47(10):1733-1740. 被引量:10
  • 2Jeffrey D, Sanjay G. MapReduce: simplified data processing onlarge clusters // Proceedings of the 6th Symposium on Operating Systems Design. San Francisco, 2004 : 137.
  • 3姚清耘,刘功申,李翔.基于向量空间模型的文本聚类算法[J].计算机工程,2008,34(18):39-41. 被引量:50
  • 4Zhang X D, Zhou X H, Hu X H. Semantic smoothing for model- based document clustering//Proceedings of the Sixth International Conference on Data Mining. Washington: IEEE Computer Society, 2006:1193.
  • 5Bharathi G, Venkatesan D. Study of ontology or thesaurus based document clustering and information retrieval. J TJeor Appl Inf Technol, 2012, 40(1) : 55.
  • 6Ma J, Xu W, Sun Y, et al. An ontology-based text-mining method to cluster proposals for research project selection. IEEE Trans Syst Man Cybern Part A, 2012, 42(3) : 784.
  • 7史庆伟,赵政,朝柯.一种基于后缀树的中文网页层次聚类方法[J].辽宁工程技术大学学报(自然科学版),2006,25(6):890-892. 被引量:11
  • 8Aswani Kumar C, Radvansky M, Annapurna J. Analysis of a vec- tor space model, latent semantic indexing and formal concept anal- ysis for information retrieval. Cybern lnf Technol, 2012, 12( 1 ) : 34.
  • 9吴夙慧,成颖,郑彦宁,潘云涛.文本聚类中文本表示和相似度计算研究综述[J].情报科学,2012,30(4):622-627. 被引量:23
  • 10Hammouda K M, Kamel M S. Efficient phrase-based document indexing for web document clustering. 1EEE Trans Knowl Data Eng, 2004, 16(10) : 1279.

二级参考文献90

共引文献96

同被引文献71

引证文献9

二级引证文献47

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部