摘要
提出了一种基于模糊聚类算法的高维特征选取方法。首先,利用Bhattacharyya距离过滤样本类别无关的特征;然后,基于递归特征剔除过程,提出了基于模糊迭代自组织数据分析技术(Interactive self-organizing dataanalysis technique,ISODATA)聚类方法,以样本与聚类中心的加权距离作为可分性指标,产生候选特征子集;最后,以候选特征子集分类和聚类的接受者操作特征曲线下面积(Area under the receiver operating characteristiccurve,AUC)值和正确率作为目标函数,确定最佳特征子集。将该方法用于选取5个基因表达谱数据集的特征基因,结果显示该方法所选特征具有较好的分类和聚类能力,说明了提出的特征选取方法的有效性。
A new feature selection method based on clustering algorithm is proposed to selecte informa- tive features. First, category-unrelated features are kicked out according to Bhattacharyya distance. Then, based on the process of recursive feature elimination, a weighted distance between sample and the cluster center generated by the fuzzy interactive self-organizing data algorithm (ISODATA) is used as the index of feature for separating different classes. Finally, the candidate feature subset with the maxi- mum area under the receiver operating characteristic curve (AUC) value and accuracy rate both in classi- fication and clustering tests is selected as the optimal feature subset. The proposed feature subset selec- tion method is applied to five gene expression profile datasets and experiment results show that the se- lected features have good performance in terms of both classification and clustering measurements. Re- sults demonstrate that the proposed method is effective for selecting informative features from high-di- mensional dataset.
出处
《南京航空航天大学学报》
EI
CAS
CSCD
北大核心
2012年第6期881-887,共7页
Journal of Nanjing University of Aeronautics & Astronautics
基金
国家自然科学基金(10172043
61173068)资助项目
教育部博士点基金(20093218110024)资助项目
江苏省国际合作(BZ2010060)资助项目
江苏省技术监督局重点(KJ122714)资助项目
安徽省教育厅自然科研重点(KJ2010A226)资助项目
关键词
特征选取
模糊迭代自组织数据分析技术
层次聚类
支持向量机
K近邻
feature selection
fuzzy iteractive self-organizing data analysis technique (ISODATA)
hierachical clustering
support vector machine (SVM)
K-nearest neighbor