摘要
特征选取是数据约简方法之一,其对提高机器学习的效率和效果具有重要影响。根据对象在特征空间中的分布,划分连续特征空间为类别单一、边界清晰的多个子空间。依统计学意义,把各个子空间分别投影到所有特征上,获取所有不同类别子空间对当前子空间特征区分能力的评估。通过构造区分能力评估矩阵,实现特征分类能力的排序。引入特征集区分能力信息增益,结合特征分类能力排序,逐一优选特征,最终完成特征子集的求解。采用UCI(University of California Irvine)数据集进行实验,获取特征子集,利用该特征子集,提高了机器学习效率和分类精度,表明了特征选取的可行性。
Feature selection is one of the methods for reduction of data sets,which improves efficiency and effectivity of machine learning.In terms of the distribution of objects and their classification labels,the continuous feature space was partitioned into a variety of subspaces,each one with a clear edge and unique classification label.After the projection of all the subspaces for each feature,the quality of each feature was estimated for a subspace opposite all the other subspaces with different classification labels by means of statistical significance.Through construction of a matrix by all the estimate qualities of all features of the subspaces,all features were ranked from the highest classifying power to the lowest on the matrix for the feature space.After the information gain function was defined by the subset of features,the feature subset was optimally determined on the basis of ranked features by gradually adding features.Experiments on the data sets from UCI(University of California Irvine) repository by the feature selection obtained feature subsets,by which the performance and classification accuracy of machine learning were improved,illustrating that the feature selection was feasible.
出处
《山东大学学报(工学版)》
CAS
北大核心
2011年第6期1-6,17,共7页
Journal of Shandong University(Engineering Science)
基金
国家高新技术研究发展计划(863计划)资助项目(2009AA062802)
国家自然科学基金资助项目(60473125)
中国石油(CNPC)石油科技中青年创新基金资助项目(05E7013)
国家重大专项子课题资助项目(G5800-08-ZS-WX)
关键词
数据约简
特征选取
连续型属性
决策表
data reduction
feature selection
continuous attributes
decision table