期刊文献+

高维DNA甲基化数据的随机森林降维分析 被引量:4

The application of random forest for high dimensional DNA methylation data
原文传递
导出
摘要 目的将随机森林算法用于类风湿性关节炎病例对照研究的高维甲基化数据的分析,并探讨应用效果。方法实例数据来自基因表达数据库(gene expression omnibus,GEO),检索号为GSE42861,包含354名病例、335名对照,本文选取类风湿性关节炎相关基因区域所在的第9号染色体,共纳入2 433个胞嘧啶-磷酸-鸟嘌呤双核苷酸(cytosine-phosphate-guanine pairs of nucleotides,CpGs)位点。利用随机森林计算变量的重要性评分并排序;对排序后的变量进行逐步随机森林过程,寻找最有可能与结果存在关联的变量子集;对降维后的变量子集进行逐步Logistic回归。结果逐步随机森林筛选出80个重要的CpG位点,Logistic回归模型中有13个位点具有统计学意义。纳入这些位点建立Logistic回归模型,该模型的预测正确率达88.29%。结论随机森林算法可以大大减少噪音变量,提高检验效能,适用于高维甲基化数据分析。 Objective To study the application of random forest algorithm for the high dimensional case-contrul DNA methylation data of rheumatoid arthritis(RA). Methods The RA dataset was obtained from gene expression omnbius (GEO) data repository ( accession number GSFA2861 ) , which contained 689 samples ( 354 patients and 335 controls). A total of 2 433 cytosine-phosphate-guanine pairs of nucleotides(CpGs) sites on chromosome 9 were included because the i- dentified RA associated area was located in this chromosome. First, these variables were sorted by the importance sores, by which were calculated through random forest. Second, stepwise random forest was carried out to find the subset variables which were most possibly associated with the outcome variable. Third, we conducted stepwise Logistic regression in the subset variables. Results Eighty important CpG sites were picked out by random forest. In our Logistic model, there were 13 statistically significant CpGs. The accuracy of the model contain these 13 CpGs was 88.29%. Conclusions Random forest algorithm can dramatically reduce the redundant variables and is applicable for high dimensional DNA methylation data.
出处 《中华疾病控制杂志》 CAS CSCD 北大核心 2016年第6期630-633,共4页 Chinese Journal of Disease Control & Prevention
基金 国家自然基金(81530088 81473070 81373102 61301251 81402764) 江苏省高校优势学科建设专项(2014年) 江苏省高等学校自然科学项目(12KJB310003) 江苏省青蓝工程资助项目(2014年)
关键词 关节炎 类风湿 DNA甲基化 流行病学方法 Arthritis, rheumatoid DNA methylation Epidemiologic methods
  • 相关文献

参考文献2

二级参考文献47

  • 1International Stroke Genetics Consortium (ISGC), Wellcome Trust Case Control Consortium 2 (WTCCC2), Bellenguez C, et al. Genome-wide association study identifies a variant in HDAC9 associated with large vessel isehemic stroke. Nat Genet, 2012, 44 ( 3 ) : 328-333.
  • 2Hu Z, Shao M, Yuan J, et al. Polymorphisms in DNA damage binding protein 2 (DDB2) and susceptibility of primmy lung cancer in the Chinese: a case-control study. Carcinogenesis, 2006, 27(7) : 1475-1480.
  • 3Hu Z, Wang H, Shao M, et ah Genetic variants in MGMT and risk of lung cancer in Southeastern Chinese: a haplotype-based analysis. Hum Murat, 2007, 28 ( 5 ) : 431 -440.
  • 4Breiman L. Random forests. Machine Learning, 2001, 45 ( 1 ) : 5-32.
  • 5Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent Statistics. J E duc Behav Statist, 2006, 25 ( 1 ) : 60-83.
  • 6Kooperberg C, Ruczinski I, LeBlanc ML, et al. Sequence analysis using logic regression. Genet Epidemiol, 2001, 21 Suppl 1: $626~631.
  • 7Friedman JH, Roosen CB. An introduetion to multivariate adaptive regression splines. Star Methods Med Res, 1995,4(3 ) : 197-217.
  • 8Hsieh CH, Lu RH, Lee NH, et al. Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vec'tor machines, amt artificial neural networks. Surgery, 2011, 149(1): 87-93.
  • 9Pang H, Lin A, Holford M, et al. Pathway analysis using random forests classification and regression. Bioinformatics, 2006, 22 (16) : 2028-2036.
  • 10Saviozzi S, Ceppi P, Novello S, et al. Non-small cell lung cancer exhibits transcript overexpression of genes associated with homologous recombination and DNA replication pathways. Cancer Res, 2009, 69(8): 3390-3396.

共引文献9

同被引文献24

引证文献4

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部