摘要
开方检验是目前文本分类中一种常用的特征选择方法。该方法仅关注词语和类别间的关系,而没有考虑词与词之间的关联,因此选择出的特征集具有较大的冗余度。定义了词语的"剩余互信息"概念,提出了对开方检验的选择结果进行优化的方法。使用该方法可以得到既有很强表征性又有很高独立性的特征集。实验表明,该方法表现良好。
CHI is a widely used feature selection method in text classification. This method only focuses on the relevance between features and classifications but ignores the relevance between feature and feature, resulting in a high redundancy. This paper proposed a concept about residual mutual information, and then CHI and residual mutual information were combined together to optimized the selective results. The experimental results indicate that the method is effective.
出处
《计算机科学》
CSCD
北大核心
2015年第5期54-56,77,共4页
Computer Science
基金
教育部博士点基金资助项目(2010081110053)资助
关键词
文本分类
特征选择
开方检验
互信息
Text categorization, Feature selection, CHI, Mutual information