摘要
特征选择和分类算法是文本分类中的两个关键技术,提出了基于主成分分析和KNN相结合的文本分类方法。该方法利用主成分分析对文本向量的高维空间进行特征选择,为克服因类别特征选择不当带来的不利影响,使用KNN算法进行分类可以最大程度地减少分类过程中的误差。为了验证方法的有效性,针对UCI标准数据集进行仿真实验。实验结果显示,PCA-KNN方法优于主成分分析和随机森林相结合的方法,能在一定程度上提高文本分类的精度。
Feature extraction and categorization algorithm are two crucial technologies for text classification. A text classification method based on PCA and KNN was presented. The proposed method use PCA to select fea-ture of the text vector from multi-dimension space. In order to overcome the negative influence for the improper category feature selection,the classification method KNN can minimize the error of the classification results. Some experiments are executed on the UCI standard data sets to demonstrate the advantages of the proposed method. The results show that PCA-KNN method is better than the method based on PCA and random forests and can improve the accuracy of text classification.
出处
《东北电力大学学报》
2013年第6期60-63,共4页
Journal of Northeast Electric Power University
基金
国家自然科学基金项目(11226263
11201057
61202261)
吉林省自然科学基金项目(201215165)