摘要
特征降维一直是文本分类的重要研究内容,针对现有特征选择方法中普遍存在误删除强区分类别能力特征而保留弱区分类别能力特征的现象,提出了一种有效的特征降维策略,该方法首先对特征进行了定义和量化,通过建立单源特征保留集,删除所有类中的公共特征,再对多源特征权值进行调整,从而达到特征削减和提高分类性能的目的。在Reuters-21578,NewsGroup语料集上进行的实验对比中表明,新的降维策略是有效可行的。
Feature dimensionality reduction has been an important research on text classification. An effective way to achieve feature dimensionality reduction is to design efficient feature selection methods. Based on the existing feature selection methods, in which the phenomenon of removing the strong features of distinction between the catego- ries ability and keeping the weak ones exists, the paper presents an efficient feature reduction algorithm, which firstly defines and quantifies features to establish the unisource feature retained set and forcibly removes the common features in all classes, and then adjusts the weights of the multi - source feature so as to achieve the target of feature reduction and improve the classification performance. Finally, a comparative analysis experiment is conducted in the Reuters - -21 578, NewsGroups corpus. The experimental result indicates that the algorithm is effective and feasible.
出处
《贵州师范学院学报》
2012年第6期6-10,共5页
Journal of Guizhou Education University
关键词
文本分类
单源特征
多源特征
特征降维
特征选择
text categorization
uniseurce feature
multisource feature
feature dimensionality reduction
feature selection