摘要
针对从文集全局角度评价特征重要性的传统特征选择方法可能忽略某些重要分类特征的问题,提出两步特征选择方法.该方法首先过滤掉类别关联性不强的特征;然后根据词的统计信息将词归为各个类别的区分词,找出每个类的分类特征的最优子集;最后,将各个类别的最优子集组合起来形成最终分类特征.实验采用朴素贝叶斯作为分类器,使用IG,ECE,CC,MI和CHI等5种特征选择公式对该方法与传统方法进行比较,得到分类性能宏平均指标对比分别为91.075%对86.971%,91.122%对86.992%,91.160%对87.470%,90.253%对86.061%,90.881%对87.006%.该方法在考虑分类特征信息的同时,尽量保留传统特征选择方法中好的特征,能更好地捕获分类信息.
Due to the fact that the traditional feature selection methods may ignore some important categorization features because they are not important enough from the perspective of whole corpus, a new two- step feature selection method is proposed. Firstly, the features which don' t have a strong relationship with categories are filtered. Secondly, the words are categorized to be taken as the category discriminating words of each category according to their statistic information and the optimal sub-sets are founded for the categorizations feature of every category. Finally, the optimal sub-sets of all the categories are combined to form the final categorization features: In the experiments, naive Bayesian is taken as categorizer and the method is compared with the traditional method by using the feature selection formulas IG, ECE, CC, MI and CHI. And the Macro-F1 obtained by the formulas are 91. 075%, 91. 122%, 91. 160%, 90.253% , 90. 881% in proposed method and are 86. 971%, 86. 992%, 87. 470%, 86. 061%, 87. 006% in the traditional method. Considering the categorization feature information, the method preserves the good features in the traditional methods as far as possible and can capture the categorization information better.
出处
《计算机辅助工程》
2008年第3期76-80,共5页
Computer Aided Engineering
基金
国家自然科学基金(60703010)
重庆市自然科学基金(2006BB2374)
重庆市教委科学技术研究项目(KJ070519)
教育部回国留学人员启动基金(教外司留[2007]1109号)
关键词
两步特征选择
中文文本分类
类别区分词
朴素贝叶斯
two-step feature selection
Chinese text categorization
category discriminating word
naive Bayesian