摘要
常用的排列法从微阵列数据中选择的基因集合会包含相关性较高的基因,这会影响分类器的性能,为了去除这些冗余基因(特征),提出了无监督的特征选择算法.该算法主要包含:将原始特征集划分为一组相似的子集(聚类);从每个聚类中选择代表性特征.特征的划分采用特征间的相关性作为测度以k近邻原则来完成.该算法无需指定聚类数量,时间复杂度低.真实的生物学数据实验证明该算法可显著提高分类器的分类准确性.
Gene sets of interest typically selected by usual ranking methods from microarray data will contain many highly correlated genes, which will degrade the performance of classifiers. To filter these redundant genes (features), an unsupervised feature selection algorithm is proposed. The task of the algorithm involves partitioning the original feature set into a number of homogeneous subsets (clusters) and selecting a representative feature from each such cluster. Partitioning of the features is done based on κ-NN (κ nearest neighbor) principles using pairwise feature correlation measures. This method does not need to specify the optimal number of clusters in advance and has less computational complexity. Real biological data experiments show that this algorithm significantly increases the classification accuracy of existing classifiers.
出处
《浙江大学学报(工学版)》
EI
CAS
CSCD
北大核心
2004年第10期1289-1292,共4页
Journal of Zhejiang University:Engineering Science
关键词
微阵列
基因选择
相关性分析
无监督学习
Biocommunications
Correlation methods
DNA sequences
Genes
Learning algorithms
Mathematical models