摘要
文本聚类是文本挖掘领域的一个重要研究分支,是聚类方法在文本处理领域的应用。本文首先对基于空间向量模型的文本聚类过程做了较深入的讨论和总结。另外,本文回顾了现有的文本聚类算法,以及常用的文本聚类效果评价指标。在研究了已有成果的基础上,本文利用20Newsgroup文本语料库,针对向量空间表示模型,在开源的数据挖掘平台WEKA上实现了文本预处理和k-means聚类算法,并根据实际聚类效果,就文本表示、特征选择、特征降维等方面提出优化方案。
Text clustering, one of the most important research braches of text mining, is the application of clustering algorithm in text processing, Firstly, this paper makes relatively deep discussion and summary in the field of VSM-based text clustering process. Moreover, it also discusses with the text clustering algorithm and introduces basic knowledge of clustering validity. On the basis of these works, by doing research with the open source corpus of 20 Newsgroup, this paper implements text preprocessing and k-means clustering algorithm based on the open source data mining tool of WEKA. According to the effects of clustering of the corpus, it presents optimization of text clustering algorithm, including feature representation, dimensionality reduction etc.optimizations of text clustering algorithm, including feature representation, dimensionalitv reduction etc.
出处
《中国管理信息化》
2009年第21期9-12,共4页
China Management Informationization