摘要
目的将随机森林算法用于类风湿性关节炎病例对照研究的高维甲基化数据的分析,并探讨应用效果。方法实例数据来自基因表达数据库(gene expression omnibus,GEO),检索号为GSE42861,包含354名病例、335名对照,本文选取类风湿性关节炎相关基因区域所在的第9号染色体,共纳入2 433个胞嘧啶-磷酸-鸟嘌呤双核苷酸(cytosine-phosphate-guanine pairs of nucleotides,CpGs)位点。利用随机森林计算变量的重要性评分并排序;对排序后的变量进行逐步随机森林过程,寻找最有可能与结果存在关联的变量子集;对降维后的变量子集进行逐步Logistic回归。结果逐步随机森林筛选出80个重要的CpG位点,Logistic回归模型中有13个位点具有统计学意义。纳入这些位点建立Logistic回归模型,该模型的预测正确率达88.29%。结论随机森林算法可以大大减少噪音变量,提高检验效能,适用于高维甲基化数据分析。
Objective To study the application of random forest algorithm for the high dimensional case-contrul DNA methylation data of rheumatoid arthritis(RA). Methods The RA dataset was obtained from gene expression omnbius (GEO) data repository ( accession number GSFA2861 ) , which contained 689 samples ( 354 patients and 335 controls). A total of 2 433 cytosine-phosphate-guanine pairs of nucleotides(CpGs) sites on chromosome 9 were included because the i- dentified RA associated area was located in this chromosome. First, these variables were sorted by the importance sores, by which were calculated through random forest. Second, stepwise random forest was carried out to find the subset variables which were most possibly associated with the outcome variable. Third, we conducted stepwise Logistic regression in the subset variables. Results Eighty important CpG sites were picked out by random forest. In our Logistic model, there were 13 statistically significant CpGs. The accuracy of the model contain these 13 CpGs was 88.29%. Conclusions Random forest algorithm can dramatically reduce the redundant variables and is applicable for high dimensional DNA methylation data.
出处
《中华疾病控制杂志》
CAS
CSCD
北大核心
2016年第6期630-633,共4页
Chinese Journal of Disease Control & Prevention
基金
国家自然基金(81530088
81473070
81373102
61301251
81402764)
江苏省高校优势学科建设专项(2014年)
江苏省高等学校自然科学项目(12KJB310003)
江苏省青蓝工程资助项目(2014年)