期刊文献+

分布式环境下的文本聚类研究与实现 被引量:3

Research and Implementation of Textual Clustering in Distributed Environment
原文传递
导出
摘要 【目的】通过开源工具,构建一种分布式环境下的文本聚类与分类应用平台。【方法】以海量文本的词收敛性为基础,通过词聚类指导文本聚类和分类。过程包括:使用开源分词器等工具进行训练集的文本预处理,结合Mahout数据挖掘平台对处理后的词集进行聚类分析,最后通过相似度算法计算测试文本与词类簇的相似度并分类。【结果】分布式环境下的基于词聚类的文本聚类分类计算方法,可有效解决海量文本的词聚类瓶颈问题。经测试,当训练文本集增加到100,迭代收敛阈值为0.01时,词聚类结果较理想。【局限】测试数据规模有限,仅限于新闻数据,基于其他领域的词聚类效果需要进一步测试、优化、调整。【结论】详细描述基于词聚类的文本聚类分类算法的开发环境构架和关键步骤,有助于研究者对相关开源工具使用及分布式并行环境部署的深入理解。 [Objective] To implement the textual clustering and classification in distributed environment through open-source tools. [Methods] According to the convergence of words in masses of text, this paper classifies texts based on word-clustering, including text preprocess by open-source tokenizer, cluster analysis by Mahout, classifying the test text by computing the similarity between the text and word-cluster. [Results] The textual clustering based on word-clustering in distributed environment effectively solves the bottleneck of word-clustering of massive texts. The tested result of word-clustering is ideal while the number of text training set exceeds 100 and the iterative convergence threshold is 0.01. [Limitations] The data type is limited in the field of news and the other field-based word-clustering also needs further test, optimization and adjustment. [Conclusions] This study describes the build process and key steps of the textual clustering and classification in distributed environment to help readers with in-der)th understood.
作者 赵华茗
出处 《现代图书情报技术》 CSSCI 2015年第1期82-88,共7页 New Technology of Library and Information Service
关键词 分布式环境 聚类 文本聚类 HADOOP Mahout Distributed environment Clustering Textual clustering Hadoop Mahout
  • 相关文献

参考文献21

  • 1胡建军,唐常杰,李川,彭京,元昌安,陈安龙,蒋永光.基于最近邻优先的高效聚类算法[J].四川大学学报(工程科学版),2004,36(6):93-99. 被引量:24
  • 2Han J, Kamber M. Data Mining Concepts and Techniques [M]. Beijing: China Machine Press, 2008: 261-284.
  • 3Pena J M, Lozano J A, Larranaga P. An Empirical Comparison of Four Initialization Methods for the K-means Algorithm [J]. Pattern Recognition Letters, 1999, 20(10): 1027-1040.
  • 4Bradley P S, Fayyad U M. Refining Initial Points for K-means Clustering [C]. In: Proceedings of the 15th International Conference on Machine Learning (ICML'98). San Francisco, USA: Morgan Kaufmann Publishers Inc., 1998: 91-99.
  • 5Steinbach M, Karypis G, Kumar V. A Comparison of Document Clustering Techniques [C]. In: Proceedings of KDD 2000 Workshop on Text Mining. 2000: 1-20.
  • 6Zhao Y, Karypis G, Fayyad U. Hierarchical Clustering Algorithms for Document Datasets [J]. Data Mining and Knowledge Discovery, 2005, 10(2): 141-168.
  • 7Higgs R E, Bemis K G, Watson I A, et al. Experimental Designs for Selecting Molecules from Large Chemical Databases [J]. Journal of Chemical Information and Computer Sciences, 1997, 37(5): 861-870.
  • 8Snarey M, Terrett N K, Willet P, et al. Comparison of Algorithms for Dissimilarity-based Compound Selection [J]. Journal of Molecular Graphics & Modelling, 1997, 15(6): 372-385.
  • 9Slonim N, Tishby N. Document Clustering Using Word Clusters via the Information Bottleneck Method [C]. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'00). New York, USA: ACM, 2000: 208-215.
  • 10MapReduce [EB/OL]. [2014-08-06]. http://Hadoop.apache. org/mapreduce/.

二级参考文献12

  • 1Han J W, Kambr M. Data mining concepts and techniques[M]. Beijing: Higher Education Press, 2001. 145~176.[2]Kaufan L, Rousseeuw P J. Finding groups in data: an introduction to cluster analysis[M]. New York: John Wiley & Sons, 1990.
  • 2Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases[A]. Haas L M, Tiwary A. Proceedings of the ACM SIGMOD International Conference on Management of Data[C]. Seattle: ACM Press, 1998. 73~84.
  • 3Ester M, Kriegel H P, Sander J, et al. A density based algorithm for discovering clusters in large spatial databases with noise[A]. Simoudis E, Han J W, Fayyad U M. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining[C].
  • 4Agrawal R, Gehrke J, Gunopolos D, et al. Automatic subspace clustering of high dimensional data for data mining application[A]. Haas L M, Tiwary A. Proceedings of the ACM SIGMOD International Conference on Management of Data[C]. Seattle: ACM Press, 1998.
  • 5Zhang T,Ramakrishnan R,Livny M. BIRCH:an efficient data clustering method for very large database[R].Computer Sciences Dept,Univ of Wisconsin-Madison,1995.
  • 6Zhang T,Ramakrishnan R,Livny M. BIRCH:an efficient data clustering method for very large databases[A]. Jagadish H V, Mumick I S. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data[C]. Quebec: ACM Press, 1996.103~114.
  • 7Beyer K S,Goldstein J,Ramakrishnan R,et al. When is 'nearest neighbor' meaningful?[A].Beeri C,Buneman P.Proceedings of the 7th International Conference on Data Theory[C].ICDT'99. LNCS1540,Jerusalem, Israel: Springer, 1999.217~235.
  • 8Karypis G,Han E H,Kumar V. CHAMELEON: a hierarchical clustering algorithm using dynamic modeling[J].IEEE Computer,1999,32(8):68-75.
  • 9田润涛.[D].郑州:河南中医学院药学院,2004:12—41.
  • 10Fang KT, Liang YZ, Yu RQ. Data Mining and Bioinformatics in Chemistry and Chinese Medicines[M]. Volume 2. Hong Kong: Hong Kong Baptist University, 2004:59 - 72.

共引文献69

同被引文献110

引证文献3

二级引证文献74

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部