摘要
随着网络的发展,大量的文档涌现在网上,自动文本分类成为处理海量数据的关键技术。在众多的文本分类算法中,kNN算法被证明是最好的文本分类算法之一。对于大多数文本分类来说,文本预处理是文本分类的瓶颈,文本预处理的好坏直接影响着分类的性能。在此介绍了一种新的文本预处理算法——基于基尼的文本预处理算法。同时采用模糊集理论改进kNN的决策规则。这两者的结合使得模糊kNN比传统的kNN表现出更好的分类性能。实验结果证明这种改进是有效的,可行的。
With the development of Web ,large numbers of documents are available on Internet. Automatic text categorization becomes more and more important for dealing with massive data. In numerous text categorization algorithms,kNN algorithm is proved one of the best text categorization algorithms. But for kNN classifier and other classifiers,text preprocessing before categorization is a bottleneck. The results of text preprocessing directly affect the categorization performance. This paper present a new text preprocessing algorithm text preprocessing algorithm based on Gini index. At the same time ,this paper adopt the theory of fuzzy sets to improve the decision rule of kNN algorithm. The combination of these two methods makes the fuzzy kNN classifier show better categorization performance than classical kNN algorithm. Experiment results show that our algorithm is effective and feasible.
出处
《广西师范大学学报(自然科学版)》
CAS
北大核心
2006年第4期87-90,共4页
Journal of Guangxi Normal University:Natural Science Edition
基金
National Natural Science Foundation of China (60503017)
Beijing Jiaotong University Science Foun-dation (2004RC008)