摘要
目的探讨随机森林(RF)的变量捕获方法在高维数据变量筛选中的应用。方法通过模拟实验和实际数据分析,对两种变量捕获(vh.md,vh.vimp)和逐步剔除方法(var SelRF)进行比较,并通过选入变量的数目、模型预测错误率(PE)和受试者工作特征曲线下面积(AUC)对其进行评价。结果模拟实验表明,在变量具有联合作用、交互作用和弱独立作用情况下,变量捕获方法均明显优于var SelRF方法和全变量VIMP排序方法;实际数据分析结果表明,变量捕获方法筛选变量结果稳定,并能够保证良好的预测效果。结论变量捕获方法适用于高维数据的变量筛选,具有实用价值。
Objective This project explored the application of a random forest-based variable hunting approach to variable selection in high-dimensional data. Methods Tw o variable hunting methods( vh. md,vh. vimp) w ere compared w ith backw ards variable elimination using random forest( var SelRF) by the analysis of simulation data and real metabonomics data,and then variable numbers,predicted error rate( PE) and the area under the receiver operating characteristic curve( AUC) w ere used to evaluate these approaches. Results Simulation experiments suggested that variable hunting method w as more effective than var SelRF and sorted VIM P method,in the case of combined effects,interactions and w eak independent effects. Analysis results of metabonomics data confirmed that the results of variable selection w ere stable and had favorable predictive effects w ith the variable hunting method. Conclusion The variable hunting approach w as applicable to variable selection in high-dimensional data and possessed practical value.
出处
《中国卫生统计》
CSCD
北大核心
2015年第1期49-53,共5页
Chinese Journal of Health Statistics
基金
国家自然科学基金资助(81172767)
高等学校博士学科专项基金(20122307110004)
关键词
随机森林
变量筛选
变量捕获
Random forest
Variable selection
Variable hunting