摘要
KNN(K—Nearest Neighbor)是向量空间模型中最好的文本分类算法之一。但是,当样本集较大以及文本向量维数较多时,KNN算法分类的效率和准确率就会大大降低。该文提出了一种提高KNN分类效率的改进算法,并且改进了相似度的计算方法,能更准确的判断维数高且样本集大的文本向量。算法在训练过程中计算出各类文本在向量空间中的分布范围,在分类过程中,根据待分类文本向量在样本空间中的分布位置,缩小其K最近邻搜索范围。实验证实改进的算法可以在保持KNN分类性能基本不变的情况下,显著提高分类效率。
KNN (K-Nearest Neighbor) is one of the best text classification algorithms by Vector Support Model. However, its efficiency and accuracy rate are very low for text classification task with high dimension and huge samples. In this paper, a new algorithm is intro- duced to improve the efficiency rate. For high precision, we also have a new way to compute the similarity of two texts. The distribution of training samples of each class is computed in the training process. According to the position of the documents in the sample space, this al- gorithm can reduce the searching range of their K nearest neighbors in the classing process. The results of experiments show that this algo- rithm can save largely the classification time and has almost the same classification performance as that of the traditional KNN classification algorithm.
作者
余悦蒙
黄小斌
YU Yue-meng, HUANG Xiao-bin (School of Information Science and Engineering, Xiamen University, Xiamen 361005, China)
出处
《电脑知识与技术》
2012年第3期1564-1566,共3页
Computer Knowledge and Technology