摘要
特征选择可以有效改善分类效率和精度,传统方法通常只评价单个特征,较少评价特征子集.在研究特征相关性基础上,进一步划分特征为强相关、弱相关、无关和冗余四种特征,建立起Markov Blanket理论和特征相关性之间的联系,结合Chi-Square检验统计方法,提出了一种基于前向选择的近似Markov Blanket特征选择算法,获得近似最优的特征子集.实验结果证明文中方法选取的特征子集与原始特征子集相比,以远小于原始特征数的特征子集获得了高于或接近于原始特征集的分类结果.同时,在高维特征空间的文本分类领域,与其它的特征选择方法OCFS,DF,CHI,IG等方法的分类结果进行了比较,在20Newsgroup文本数据集上的分类实验结果表明文中提出的方法获得的特征子集在分类时优于其它方法.
Feature selection(FS) can effectively improve the speed and accuracy of classification. The traditional FS approaches usually score a single feature, do not evaluate feature subset. Based on the research in feature relevance, features can be further divided into four categories: Strong relevance, weak relevance, irrelevance and redundancy. The paper proposes a forward selection algorithm-An approximate Markov Blanket (MB) feature selection by theory of MB and Chi-Square test, which obtain an approximate optimal feature subset. Experiments on the datasets suggest that, compared with original feature set, the feature subset obtained by the proposed approach is much less than original feature set and performance on actual classification is better than or as good as that by original feature set. Meanwhile, when used in high dimension feature space such as text categorization, compared with other traditional feature selection approaches. OCFS, DF, CHI, IG, the performance obtained by the proposed method is obviously superior to that of others on 20 Newsgroup dataset.
出处
《计算机学报》
EI
CSCD
北大核心
2007年第12期2074-2081,共8页
Chinese Journal of Computers
基金
国家杰出青年科学基金(60425206)
国家自然科学基金(60503020)
江苏省高校自然科学研究计划项目基金(04kjb520096)资助~~