摘要
目的将基于最大相关最小冗余(maximum relevance minimum redundancy,MRMR)的朴素贝叶斯分类器(naive bayesian classifier,NBC)应用于基因表达数据并与经典NBC、随机森林(random forests,RF)进行比较。方法采用Matlab与R软件编程,应用结肠癌与肺癌基因表达数据集,分别采用上述三种方法进行比较研究,使用10-折交叉验证方法估计经典NBC与RF的分类准确率。结果应用MRMR-NBC分析结肠癌基因表达数据集显示,采用信息熵(mutual information quotient,M IQ)法,当特征m=11时分类准确率达93.55%;而采用信息差(mutual information difference,M ID)法时,当m=15时分类准确率达到95.16%。应用MRMR-NBC分析肺癌基因表达数据集显示,采用MIQ法,当m=14时分类准确率最高达98.63%,而采用MID法时当m=12时分类准确率达到97.26%。而采用经典NBC分析结肠癌与肺癌基因表达数据时,分类准确率分别为66.67%、80.00%;RF在分析结肠癌与肺癌基因表达数据时,分类准确率分别为81.89%、77.62%。结论 M RM R-NBC能在仅有极少属性参与分类时,得到较高的分类准确率,优于经典NBC与RF。
Objective To apply Naive Bayesian classifier with Maximum Relevance Minimum Redundancy(MRMR) feature selection methods into gene expression data, and to compare it with Naive Bayesian classifier( NBC ) and Random Forests (RF). Methods The three methods were applied to classify the colon and lung genes by Matlab and R software. 10-fold cross-validation was used to estimate the classification accuracy. Results When applying MRMR-NBC method to classify the colon genes,the classification accuracy reached 93.55% with features with mutual information quotient(MIQ) ,95.16% with with mutual information difference(MID). When applying MRMR-NBC method to classify the lung genes ,the classification accura- cy reached 98.63% with with MIQ,97. 26% with with MID. When applying NBC to classify both of the colon and lung genes, the classification accuracy reached 66. 67% and 80. 00% ; when applying Random Forests to classify both of the colon and lung genes,the classification accuracy reached 81.89% and 77.62%. Conclusion The classification accuracy of MRMR-NBC can reach higher than NBC and RF with fewer features.
出处
《中国卫生统计》
CSCD
北大核心
2015年第6期932-934,共3页
Chinese Journal of Health Statistics
基金
国家自然科学基金(81373103)
重庆市科委基础与前沿研究计划项目(cstc2013jcyj A10009)
关键词
最大相关最小冗余
朴素贝叶斯分类器
随机森林
特征选择
Maximum relevance minimum redundancy
Naive Bayesian classifier
Random forests
Feature selection