摘要
目的:探讨存在混杂因素时高维数据中随机森林(random forest,RF)的分析方法。方法:通过模拟实验和实例数据分析对单纯随机森林分析、增加节点候选变量为最大值以及基于广义线性模型的残差校正混杂因素的结果进行比较,以重要变量的重要性评分排序情况进行评价。结果:模拟实验表明,增加节点候选变量的方法对混杂因素的校正效果不明显,而基于广义线性模型残差的方法能有效校正混杂效应;实际数据分析结果显示单纯随机森林分析rs3754686和rs2322660分别排在第一和第二位。增加节点候选变量后rs3754686排序变化较小,而基于残差的方法校正人群分层后这两个单核苷酸多态位点(SNPs)的排序大幅度降低,从而打破乳糖酶(LCT)基因与身高之间的虚假关联。结论:随机森林分析需要考虑混杂因素问题,基于广义线性模型的残差能有效校正混杂因素,适用于高维数据的变量筛选。
Objective:This project explored a random forest(RF)analysis of high-dimensional data with the confounding effects.Methods:We used computer simulations and real data validation to evaluate the performance of 2 methods which can potentially account for the confounding effects in RF analysis:RF analysis with maximum candidate variables at each split(RFMCV)and RF with glm-based correction. The distribution of ranks of the causal variable was used to evaluate these approaches. Results:Simulation experiments suggested that RF with glm-based correction was more effective than the RFMCV to correct the confounding effects. The real data validation showed that rs3754686 and rs2322660 were ranked first and second,respectively. Analysis results of GWAS data confirmed that RF with glm-based correction can effectively remove the spurious association between the LCT gene and height.Conclusion:The confounding effects should be correctly adjusted in RF analysis. RF with glm-based correction was applicable to adjust the confounding effects and variable selection in high-dimensional data.
作者
尤东方
魏永越
张汝阳
陈峰
赵杨
You Dongfang;Wei Yangyue;Zhang Ruyang;Chen Feng;Zhao Yang(Department of Biostatistics,School of Public Health,Key Laboratory of Biomedical Bigdata,NMU,Nanjing 211166,China)
出处
《南京医科大学学报(自然科学版)》
CAS
CSCD
北大核心
2018年第7期978-982,共5页
Journal of Nanjing Medical University(Natural Sciences)
基金
国家重点科研项目(2016YFE0204900)
国家自然科学基金(81373102,81530088,81473070,81402764,81402763)
江苏省青蓝工程学科带头人
江苏省预防医学优势学科
江苏高校品牌专业建设工程资助项目(PPZY2015A067)
江苏省自然科学基金重点项目(14JA31002)
关键词
随机森林
混杂因素
残差
人群分层
random forest
confounding effect
residual
population stratification