摘要
在特征选择过程中,通过特征选择评估函数得到的词的权值大小决定该词是否作为特征词,然而词的权值受多种因素影响,主要因素有词的重要性、特征性和代表性。从以上几个因素出发,构建新的特征选择函数TW,通过对词的卡方分布CHI、信息增益IG和新的特征选择函数TW做对比实验,验证TW能够提高类别中专有词汇的权值,降低常见但对分类不重要的特征的权值;将TW作为新的特征选择算法,通过在中文分类语料库中分别采用KNN、类中心和支持向量机(SVM)三种分类方法进行实际分类实验,并与其他特征选择算法进行比较,验证该特征选择算法的有效性。
In the process of feature selection, term's weight determines whether the term can be a feature. But the weight is affected by many factors, the main factors are term's importance, characteristics and representative. With the consideration of those factors, a new function TW ( Term Weight) based on the importance of the feature and the ability of category distinguishing, is brought to be an improved method to select features. After that, experiments on the comparison between term's CHI, IG and TW validate that TW can increase the weight of special features in a class and can decrease the weight of unimportant features. Finally, the validity of the new algorithm in feature selection is proved by the classifi- cation experiments on Chinese classification corpus by three classifiers.
出处
《现代图书情报技术》
CSSCI
北大核心
2013年第5期34-39,共6页
New Technology of Library and Information Service
基金
国家高技术研究发展计划(863计划)资助项目"农产品全供应链多源信息感知技术与产品开发"(项目编号:2012AA101701)的研究成果之一
关键词
文本分类
特征选择
类别区分
TF-IDF
Text categorization Feature selection Class discrimination TF-IDF