摘要
特征降维是文本分类过程中的一个重要环节,为了提高特征降维的准确率,选出能有效区分文本类别的特征词,提高文本分类的效果,提出了结合文本类间集中度、文本类内分散度和词频类间集中度的特征降维方法。当获取特征词在文本集上的整体评价时,提出了一种新的全局评估函数,用最大值与次大值之差作为最终的评价函数值。实验比较了该方法与传统的特征降维方法,结果表明该方法在中文文本分类中具有较好的降维效果。
Feature dimension reduction is an important part of the procedure of text categorization,in order to improve the accuracy of feature dimension reduction,select the words that can distinguish categories effectively,and ultimately improve the effect of text classification,this paper proposed a new approach for feature selection by comprehensively taking account of text concentration among classes,dispersion within the text classes and word frequency concentration among classes.While getting overall assessment of the word in text set,it proposed new function of overall assessment by using the final assessment value,which was the difference of the maximum and the second largest value.The test compared this method with the traditional feature dimension method,results indicate better effect in Chinese text categorization.
出处
《计算机应用研究》
CSCD
北大核心
2012年第7期2541-2543,共3页
Application Research of Computers
关键词
文本分类
特征降维
集中度
分散度
评估函数
text categorization
feature dimension reduction
concentration
dispersion
assessment function