摘要
【目的】通过开源工具,构建一种分布式环境下的文本聚类与分类应用平台。【方法】以海量文本的词收敛性为基础,通过词聚类指导文本聚类和分类。过程包括:使用开源分词器等工具进行训练集的文本预处理,结合Mahout数据挖掘平台对处理后的词集进行聚类分析,最后通过相似度算法计算测试文本与词类簇的相似度并分类。【结果】分布式环境下的基于词聚类的文本聚类分类计算方法,可有效解决海量文本的词聚类瓶颈问题。经测试,当训练文本集增加到100,迭代收敛阈值为0.01时,词聚类结果较理想。【局限】测试数据规模有限,仅限于新闻数据,基于其他领域的词聚类效果需要进一步测试、优化、调整。【结论】详细描述基于词聚类的文本聚类分类算法的开发环境构架和关键步骤,有助于研究者对相关开源工具使用及分布式并行环境部署的深入理解。
[Objective] To implement the textual clustering and classification in distributed environment through open-source tools. [Methods] According to the convergence of words in masses of text, this paper classifies texts based on word-clustering, including text preprocess by open-source tokenizer, cluster analysis by Mahout, classifying the test text by computing the similarity between the text and word-cluster. [Results] The textual clustering based on word-clustering in distributed environment effectively solves the bottleneck of word-clustering of massive texts. The tested result of word-clustering is ideal while the number of text training set exceeds 100 and the iterative convergence threshold is 0.01. [Limitations] The data type is limited in the field of news and the other field-based word-clustering also needs further test, optimization and adjustment. [Conclusions] This study describes the build process and key steps of the textual clustering and classification in distributed environment to help readers with in-der)th understood.
出处
《现代图书情报技术》
CSSCI
2015年第1期82-88,共7页
New Technology of Library and Information Service