摘要
随机森林算法可对特征进行重要性排序,并能提高运行效率和分类的准确率.采用方差分析、随机森林算法对乳腺癌基因进行筛选,使得用随机森林算法、支持向量机算法和k近邻算法测试集的准确率分别达到95.6%,92.9%和92.7%,并发现了区分乳腺癌不同亚型的两种最重要的基因GATA3和ESR1.
The random forest algorithm can rank features in accordance with their importance and improve the efficiency of operation and the accuracy of classification.In a study reported herein,variance analysis and the random forest algorithm were used to select the characteristics of breast cancer,and the accuracy rate of the random forest algorithm,the CVM(support vector machine)algorithm and the KNN(k-nearest neighbor)algorithm were 95.6%,92.9%and 92.7%,respectively.Two most important genes,GATA3 and ESR1,were discovered,which can distinguish different subtypes of breast cancer.
作者
杨绍华
陈冬东
张旭
何林
YANG Shao-hua;CHEN Dong-dong;ZHANG Xu;HE lin(School of Mathematics and Statistics, Southwest University, Chongqing 400715, China;Institute of Botany, Chinese Academy of Sciences, Beijing 100049, China)
出处
《西南大学学报(自然科学版)》
CAS
CSCD
北大核心
2018年第5期113-116,共4页
Journal of Southwest University(Natural Science Edition)
基金
国家自然科学基金项目(11701471)
重庆市基础科学与前沿技术研究项目(cstc2017jcyjAX0476)
关键词
数据挖掘
微阵列
乳腺癌
分类
data mining
microarray
breast cancer
classification