摘要
文本聚类中,文本特征向量的高维性使得对样本统计特征的评估十分困难,所以有必要进行有效的维数约简。ISOMAP是一类新近出现的非线性维数约简方法,可以有效地对文本特征空间进行降维处理,该方法改进了样本向量之间的距离度量,用测地距离代替传统的欧式距离,将高维的文本特征数据映射到2~3维的低维可视化空间上,达到数据降维目的,实现文本数据特征可视化,并在一定程度上解决聚类数问题。最后通过实例,验证了方法的可行性。
In text clustering procedure,it's very dimcult to evaluatc the statistical characteristics of samples because of the high dimensions,so effective dimensionality reduction is quite necessary. ISOMAP is a popular recent approach to nonlinear dimensionality reduetion method,it can reduce dimensionality effectively,this method improved the distance measurement between samples by replacing the classical Euclidean distance with the geodesic distance,then mapped text feature data from high-dimensional space into low-dimensional space(2 or 3 dimensions),therefore dimensionality was reduced,visualization for high-dimensional text feature data was realized,and a proper cluster number was obtained. At last,the experiment shows the validity of this method.
出处
《微型电脑应用》
2009年第8期25-26,29,共3页
Microcomputer Applications
关键词
文本聚类
等容特征映射
降维
数据可视化
Text clustering
ISOMAP
Dimension reduction
Data visualization