摘要
依据基因表达谱有效建立肿瘤分类模型的关键在于,准确找出决定样本类别的一组特征基因。粗糙集理论作为一种新的软计算方法能够保持在原数据集的分类能力不变的基础上,对属性极大约简,从大量基因中找到对分类有效的基因。由于基因表达谱数据集的连续性,为了避免运用粗糙集方法所必需的离散化过程带来的信息丢失,尝试将模糊粗糙集应用于特征基因的选取,提出了基于互信息的模糊粗糙集属性约简算法,运用于基因表达谱数据集的基因选取。然后分别采用KNN和C5.0分类器进行特征基因分类性能进行检验。以急性白血病亚型(leukemia Microarray)和直肠癌(colon Microarray)分类特征基因选取为例进行实验,结果表明了上述方法的可行性和有效性。
Feature selection is an essential step to perform cancer classification with DNA microarrays,for there are a large number of genes from which to predict classes and a relatively small number of samples. Rough set theory is a tool for reducing redundancy in information systems, thus successful application of rough set to gene selection is of great si- gnificance. Fuzzy rough set was introduced to avoid losing information caused by discretization of continuous gene expression data which is needed in rough set theory. A novel gene selection method called IMIBAFRAR was improved to reduce the computation of mutual infor-mation. Then KNN and C5.0 were applied to validate the classification perfor- mance of the genes selected for distinguishing different tissue type. The work was applied to two public gene expression datasets:leukemia and colon. Experimental results show the selected genes don't reflect the classification ability of the original genes. Compared with the unreduced genes and the genes selected by classical rough set method, our method leads to significantly improved recognition accuracy. Meanwhile, computational complexity is reduced.
出处
《计算机科学》
CSCD
北大核心
2009年第3期196-200,共5页
Computer Science
基金
国家自然科学基金项目(60475019)
国家自然科学基金重点项目(60534060)
国家重点基础研究发展计划(973计划)(2003CB316902)
2006年博士学科点专项科研基金(20060247039)资助
关键词
基因表达谱数据集
特征选取
粗糙集
模糊粗糙集
互信息
Gene expression data, Feature selection, Rough sets, Fuzzy rough sets, Mutual information