摘要
随着网上电子文档的急剧增长,文本分类技术在信息检索中的应用变得日益重要.特征维数增加会使样本统计特性的评估变得更加困难,从而降低分类器的泛化能力,出现“过学习”的现象.因此,文档特征的选择和提取是文本分类的必要前提.提出一种基于类别信息的特征选择方法,该方法在尽量保留文档信息的同时,考虑了文档的类别信息.实验表明,这种方法的分类性能比较好,特别是在微平均指标上,与OCFS以及卡方统计量相比有较大幅度的提高.
With the explosion of web documents, text classification becomes more important in Information Retrieval applications. It is very difficult to evaluate the statistical characteristics of samples because of the high dimensions. It will lead to "over study" and reduce classifiers' performance. So that feature selection and extraction before analysis are necessary. A class information feature selection method is proposed, in which the class information of the training document is taken into account while keeping as much decument information as possible. The experiments show that this method can get good performance, and it is consistently better than OCFS and CHI on macro average F1.
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2006年第3期10-13,59,共5页
Journal of Shandong University(Natural Science)
基金
教育部重点科技资助项目(03070)
江西省自然科学基金资助项目(0311041)
关键词
特征选择
文本分类
类间分布
类内分布
feature selection
text classification
distribution between class
distribution within class