摘要
针对文本类型数据的分类进行研究,用VSM模型和TF-IDF技术对文本文件进行了数据样本抽取加权,得到文本相似度矩阵;采用不同样本距离计算方法和K-Means算法对数据进行了聚类实验,获得聚类结果并进行了分析和总结;基于实验结论,研究了不同距离计算方法之间的区别以及适用的数据类型。
Text data samples were extracted and weighted and the text similarity matrices were obtained by vector space model( VSM) model and TF- IDF weighting technology. The data clustering was conducted via different distance calculation methods and K-Means algorithm. The clustering results were analysed. The differences among the distance calculation methods and the applicable data types were studied.
出处
《福建工程学院学报》
CAS
2016年第1期80-85,共6页
Journal of Fujian University of Technology