摘要
简单分析了词频方法和文档频方法,在总结其不足的基础上,提出了一个类别相关性方法,随后分析了ID 3中信息增益的缺点并引进属性依赖度来加以改进,并进一步根据其中信息增益的计算特点,利用凸函数的性质来进行简化,减少了信息增益的计算量,提高了信息增益的计算效率;最后将此优化的ID 3同类别相关性方法结合起来,提出了一个综合的特征选择方法。该综合方法首先使用类别相关性方法进行特征初选以降低文本向量的稀疏性,然后再使用优化的ID 3来进一步选择特征,从而获得较具代表性的特征子集。实验结果表明该方法性能良好。
Word frequency and document frequency are analyzed,and their deficiencies are summarized.The category correlation method is presented.Subsequently,it analyzes the shortcomings of information gain in ID3 and introduces attribute dependence to improve information gain.According to the characteristic of information gain,it simplifies information gain to reduce computing complexity by convex function.Finally,it combines the improved ID3 with the category correlation method and proposes a comprehensive feature selection method.The comprehensive method uses the category correlation method to select features to reduce the sparsity of feature spaces,and employs the improved ID3 to select features again,so it acquires the more representative feature subsets.The experimental results show that the combined method is promising.
出处
《数据采集与处理》
CSCD
北大核心
2011年第2期230-234,共5页
Journal of Data Acquisition and Processing
基金
河南省基础与前沿技术研究计划(102300410266)资助项目
关键词
文本分类
信息增益
属性依赖度
text categorization
information gain
attribute dependence