摘要
特征选择是中文文本自动分类领域中极其重要的研究内容,其目的是为了解决特征空间高维性和文档表示向量稀疏性之间的矛盾。针对互信息(MI)特征选择方法分类效果较差的现状,提出了一种改进的互信息特征选择方法IMI。该方法考虑了特征项在当前文本中出现的频率以及互信息值为负数情况下的特征选取,从而能更有效地过滤低频词。通过在自动分类器KNN上的实验表明,改进后的方法极大地提高了分类精度。
Feature selection is extremely important research of automatic categorization, and its purpose is to solve the contradiction between the high dimensional feature space and sparse vector of the document. For the less effective classification results of mutual information feature selection method, an improved mutual information feature selection method, IMI,was presented. This method not only takes into the current frequency of feature in text, but also takes into the case of mutual information value is negative. Low frequency words can be filtered more effective. Experiments of automatic categorization based KNN show that IMI improves the classification accuracy.
作者
康岚兰
董丹丹
KANG Lan-lan,DONG Dan-dan (Faculty of Applied Science, Jiangxi University of Science and Technology, Ganzhou 341000, China)
出处
《电脑知识与技术》
2009年第12Z期9889-9890,共2页
Computer Knowledge and Technology
关键词
中文文本自动分类
特征选择
互信息
automatic categorization
feature selection
mutual information