期刊文献+

随机森林的变量捕获方法在高维数据变量筛选中的应用 被引量:17

The Application of a Random Forest-based Variable Hunting Method to Variable Selection in High-dimensional Data
下载PDF
导出
摘要 目的探讨随机森林(RF)的变量捕获方法在高维数据变量筛选中的应用。方法通过模拟实验和实际数据分析,对两种变量捕获(vh.md,vh.vimp)和逐步剔除方法(var SelRF)进行比较,并通过选入变量的数目、模型预测错误率(PE)和受试者工作特征曲线下面积(AUC)对其进行评价。结果模拟实验表明,在变量具有联合作用、交互作用和弱独立作用情况下,变量捕获方法均明显优于var SelRF方法和全变量VIMP排序方法;实际数据分析结果表明,变量捕获方法筛选变量结果稳定,并能够保证良好的预测效果。结论变量捕获方法适用于高维数据的变量筛选,具有实用价值。 Objective This project explored the application of a random forest-based variable hunting approach to variable selection in high-dimensional data. Methods Tw o variable hunting methods( vh. md,vh. vimp) w ere compared w ith backw ards variable elimination using random forest( var SelRF) by the analysis of simulation data and real metabonomics data,and then variable numbers,predicted error rate( PE) and the area under the receiver operating characteristic curve( AUC) w ere used to evaluate these approaches. Results Simulation experiments suggested that variable hunting method w as more effective than var SelRF and sorted VIM P method,in the case of combined effects,interactions and w eak independent effects. Analysis results of metabonomics data confirmed that the results of variable selection w ere stable and had favorable predictive effects w ith the variable hunting method. Conclusion The variable hunting approach w as applicable to variable selection in high-dimensional data and possessed practical value.
出处 《中国卫生统计》 CSCD 北大核心 2015年第1期49-53,共5页 Chinese Journal of Health Statistics
基金 国家自然科学基金资助(81172767) 高等学校博士学科专项基金(20122307110004)
关键词 随机森林 变量筛选 变量捕获 Random forest Variable selection Variable hunting
  • 相关文献

参考文献1

二级参考文献14

  • 1Breiman L. Random Forests. Statistics Department University of California Berkeley, CA 94720, January,2001.
  • 2Sander O, Sommer I, Lengauer T. Local protein structure prediction using discriminative models. BMC Bioinformatics,2006,7:14.
  • 3Bao L,Cui Y. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary informarion. Bioinformatics,2005,21 : 2185 -2190.
  • 4Jiang HY, Deng YP, Chen HS, et al. Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics ,2004,5 : 81.
  • 5Zhang HP, Yu CY, Singer B. Cell and tumor classification using gene expression data: Construction of forests. Proe Natl Acad Sci USA, 2003,100:4168-4172.
  • 6Lunetta KL, Hayward LB, Segal J, et al. Screening large-scale association study data:exploiting interactions using random forests. BMC Genet,2004,5:32.
  • 7Pang H, Lin AP, Holford M, et al. Pathway analysis using random forests classification and regression. Bioinformatics,2006 ,22 :2028-2036.
  • 8Hoffmann K, Firth MJ, Beesley All, et al. Translating microarray data for diagnostic testing in childhood leukaemia. BMC Cancer, 2006,6 : 229.
  • 9Brett A, McKinney DM Reif, Ritchie MD. J H M Machine learning for detecting gene-gene interactions. Appl Bioinformatics, 2006,5 ( 2 ) : 77- 88.
  • 10Lin N, Wu BL, Jansen R, et al. Information assessment on predicting protein-protein interactions. BMC Bioinformatics,2004,5 : 154.

共引文献27

同被引文献149

引证文献17

二级引证文献77

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部