摘要
为了解决基因数据集的基因选择难题,提出一种基于K-S检验与最小冗余最大相关(minimum redundancy-maximum relevance,mRMR)原则的基因选择算法。该算法先采用K-S检验选择出具有一定区分能力的基因,然后对选择到的基因进行mRMR判断,保留与类别高度相关而其间相关性较小的基因构成最终被选基因子集。以SVM为分类器,以F1_measure、分类准确率和AUC为评价指标对该算法选择的基因子集进行评估,并将本算法与K-S检验、mRMR,以及经典的RELIEF和FAST算法进行比较。五个经典基因数据集上的平均实验结果表明:本算法的运行时间远低于mRMR算法,且其各项评价指标值优于其他比较算法。因此,提出的K-S检验与mRMR结合的基因选择算法能选择到非常有效的基因子集。
To deal with the challenging problem of selecting the distinguished genes in the gene expression datasets,this paper presented a gene subset selection algorithm based on K-S test and mRMR principles. The algorithm selected the distinguished genes in K-S test firstly,then it used the minimum redundancy-maximum relevance principle to select the genes from those selected by K-S test. It adopted SVM as the classification tool,and used the criteria of F1_measure,accuracy and AUC to evaluate the performance of the classifiers on the selected gene subsets. It compared the proposed gene subset selection algorithm with K-S,mRMR,RELIEF and FAST algorithms. The average experimental results of the aforementioned gene selection algorithms on 5 popular gene expression datasets demonstrate that the new K-S and mRMR based algorithm is significantly faster than mRMR,and the performance of it under the criteria of F1_measure,accuracy and AUC is better than those of K-S,mRMR,RELIEF and FAST. So,the proposed gene subset selection algorithm can find the excellent gene subset.
出处
《计算机应用研究》
CSCD
北大核心
2016年第4期1013-1018,1043,共7页
Application Research of Computers
基金
陕西省科技攻关项目(2013K12-03-24)
国家自然科学基金资助项目(31372250)
中央高校基本科研业务费专项资金项目(GK201503067)