摘要
特征选择在文本分类中起重要的作用.文档频率(DF)、信息增益(IG)和互信息(MI)等特征选择方法在文本分类中广泛应用.已有的实验结果表明,IG是最有效的特征选择算法之一,DF稍差而MI效果相对较差.在文本分类中,现有的特征选择函数性能的评估均是通过实验验证的方法,即完全是基于经验的方法,为此提出了一种定性地评估特征选择函数性能的方法,并且定义了一组与分类信息相关的基本的约束条件.分析和实验表明,IG完全满足该约束条件,DF不能完全满足,MI和该约束相冲突,即一个特征选择算法的性能在实验中的表现与它是否满足这些约束条件是紧密相关的.
Text categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. Due to the increased availability of documents in digital form and the rapid growth of online information, TC has become a key technique for handling and organizing text data. One of the most important issues in TC is feature selection (FS). Many FS methods have been put forward and widely used in the TC field, such as information gain (IG), document frequency thresholding (DF) and mutual information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI) . A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, a theoretical performance evaluation function for FS methods is put forward in text categorization, Some basic desirable constraints that any reasonable FS function should satisfy are defind and then these constraints on some popular FS methods are checked, including IG, DF and MI. It is found that IG satisfies these constraints, and that there are strong statistical correlations between DF and the constraints, whilst MI does not satisfy the constraints. Experimental results on Reuters 21578 and OHSUMED corpora show that the empirical performance of a feature selection method is tightly related to how well it satisfies these constraints.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2008年第4期596-602,共7页
Journal of Computer Research and Development
基金
国家自然科学基金项目(60473002,60603094)
北京自然科学基金项目(4051004)
关键词
特征选择
文本分类
信息检索
信息增益
互信息
feature selection
text categorization
information retrieval
information gain
mutual information