摘要
为了获得更好的文本分类准确率和更快的执行效率,研究了多种Web文本的特征提取方法,通过对互信息(MI)、文档频率(DF)、信息增益(IG)和χ2统计(CHI)算法的研究,利用其各自的优势互补,提出一种基于主成分分析(PCA)的多重组合特征提取算法(PCA-CFEA)。通过PCA算法的正交变换快速地将文本特征空间降维,再通过多重组合特征提取算法在降维后的特征空间中快速提取出更具代表性的特征项,过滤掉一些代表性较弱的特征项,最后使用SVM分类器对文本进行分类。实验结果表明,PCA-CFEA能有效地提高文本分类的正确率和执行效率。
In order to obtain a better text classification accuracy and faster execution efficiency, this paper studied a variety of Web text feature extraction method, based on the MI, DF, IG and CHI algorithm, through using of their complementary ad- vantages, proposed a combinations of feature extraction algorithm based on PCA-CFEA. First, it used the orthogonal transfor- mation of the PCA algorithm to faster dimensionality reduction of the text feature space. Then through the multiple combination feature extraction algorithm in the lower dimension of feature space fast extract more representative of the feature, it filtered out some representative weak feature items. Finally, it used the SVM classifier to classify the text. The experimental results show that PCA-CFEA algorithm can effectively improve text classification accuracy and running efficiency.
出处
《计算机应用研究》
CSCD
北大核心
2013年第8期2398-2401,共4页
Application Research of Computers
基金
江苏省2010年度青蓝工程骨干教师资助项目(苏教2010-16)