摘要
癌症诊断是生物信息学领域的重要课题,其中从基因表达数据中选择与癌症相关的基因子集是癌症诊断的关键。随机森林是近年来很热门的算法,它能够评估分类中特征的重要性(该方法简称为PBM)。受此启发,提出了两种基于树结构的基因选择方法 FBM和ABM,分别以树结构中特征出现的频率和重要性打分的平均值作为属性重要性的指标。数值实验中,使用提出的方法选取特征子集,并建立随机森林分类器,通过AUC结果评估基因选择的优劣。实验结果表明,当PBM的AUC值不低于0.900时,其在Leukemia数据集上至少需要26个基因,在Colon Cancer数据集上至少需要48个基因。而在仅选取前10个基因时,FBM和ABM在Leukemia数据集的AUC值均达到0.989,在Colon Cancer数据集的AUC值达到0.900。此外,与其它典型的基因选择方法 mRMR和ECRP等相比,提出的方法也有较高的精度,这对癌症的精确诊断和及早治疗具有重要的现实意义。
Cancer diagnosis is one of the most significant topics in bioinformatics.For the microarray datasets,selecting a small subset of genes from thousands of genes(named gene selection)is helpful for accurate identification and treatment of cancerous tumors.Motivated by the instinct of random forests measuring variable importance(named‘PBM'),we proposed two novel methods based on the tree structures for gene selection,namely FBM and ABM.They respectively make use of gene frequency and average scores yielded by agreat number of decision trees,which are constructed on the microarray datasets.In computational experiments,the optimal gene subsets are determined by three methods,and random-forest classifiers are built on subsets to evaluate the performance of gene selection methods.AUC scores of PBM are greater than 0.900 when selecting 26 genes for leukemia dataset and 48 genes for colon cancer dataset,while the classifiers with FBM and ABM can achieve the AUC score of 0.989 for leukemia dataset and AUC score of 0.900 for colon cancer dataset respectively with top ten genes selected.In addition,the proposed methods have better performance than the developed methods(such as mRMR and ECRP),which play the critical roles in the accurate diagnosis and treatment of cancer.
出处
《计算机科学》
CSCD
北大核心
2015年第7期250-253,共4页
Computer Science
基金
国家自然科学基金(61271337
61103126)
教育部博士点基金(20100141120049)
湖北省自然科学基金(2011CDB454)
深圳市战略新兴产业发展专项资金项目(JCYJ20130401160028781)资助
关键词
分类
基因选择
随机森林
Classification
Gene selection
Random forests