摘要
特征提取是文本分类过程中的一个重要环节,它的好坏将直接影响文本分类的准确率。在研究文本分类特征提取方法的基础上,分析了χ2统计的不足,并提出将频度、集中度、分散度应用到χ2统计方法上,对χ2统计进行改进,并通过实验对比改进前后的方法对文本分类效果的影响。在实验中,改进方法的分类效果要好于传统方法,从而验证了改进方法的有效性和可行性。
Feature extraction technology is an essential part of text categorization, which directly affects the categorization precision. This paper comprehensively took frequency, distribution and concentration into account and proposed an improved Chi-square Statistic(CHI) approach. In order to verify the improved CHI approach, a eontrastive experiment was carried out. The experimental results show that improved CHI approach is superior to traditional CHI approach in feature selection, which verifies the efficiency and probability of the improved CHI approach.
出处
《计算机应用》
CSCD
北大核心
2008年第2期513-514,518,共3页
journal of Computer Applications
基金
重庆市科委自然科学基金资助项目(CSTC2006BB2021)
关键词
特征提取
x^2统计
频度
集中度
分散度
feature extraction
CHI approach
frequency
concentration
distribution