摘要
基因(特征)数远大于条件(样本)数,基因表达数据中往往存在大量噪声,并且生物学或医学工作者期望能从大量的基因中挑选出与疾病诊断有关的标志基因,因此,应用基因表达数据进行疾病分类预测的关键环节是基因选择。目前常用的方法有过滤法和缠绕法。结合过滤法和缠绕法的优点,提出基因选择的多目标分布估计算法(MOEDA)。首先通过打分函数确定MOEDA的候选基因集合,在确定候选基因后,MOEDA通过对KNN分类器的多个性能指标及基因数目等多个目标进行优化,从候选基因中选取综合区分能力最强的特征基因子集。儿童小圆蓝细胞肿瘤数据SRBCT上的实验结果表明,本方法在不需要设置复杂参数的情况下,从2000个基因中仅选取了7个基因,就使分类器在独立测试集上的分类精度达到95%。
The number of genes is usually much more than that of patient samples. Meanwhile, influenced by systematical error, technique limitation and so on, much noise exists in the gene expression data. Moreover, in the view of biological scholars, they want to find a small group of biomarker genes from the raw dataset, which could help them find the relationship between genes and cancers. Therefore, it is necessary to select optimal genes from the raw dataset in the prognosis and diagnosis of cancers. This paper integrated above two gene selection strategies and proposed MOEDA to select final optimal genes. First, a process filtered the raw dataset to reserve genes with high evaluation score. Taking accuracy, sensitivity and scale into account, MOEDA optimized these objectives for KNN and produce final optimal genes. None of complex parameter setting, the experiment on the dataset SRBCT gets 95% accuracy on the independent testing set with 7 genes selected from the 2 000 genes.
出处
《计算机应用研究》
CSCD
北大核心
2009年第8期2891-2894,共4页
Application Research of Computers
基金
国家自然科学基金资助项目(60773010)
关键词
分类预测
基因选择
多目标演化
classification
gene selection
multi-objective estimation of distribution algorithm(MOEDA)