Rstudio和随机丛林在高维全基因组学数据分析中的应用被引量：2

The Application of Rstudio and Random Jungle in High-dimensional Genomewide Data

下载PDF

导出

摘要目的结合Rstudio和Random Jungle两款软件的优势,通过远程登录简便、快捷地实现全基因组学数据的分析。方法在服务器端搭建Rstudio Server,封装随机丛林R程序并对英国威康信托病例对照协会(WTCCC)高血压真实数据进行分析,通过Rstudio Sweave动态生成分析结果。结果在客户端即可通过网络浏览器登陆服务器端搭建好的Rstudio Server;数据量较大时,封装后的随机丛林程序在Rstudio Server中运算速度方面相对于R random Forest包优势明显。2001名高血压病例及3004名对照的22条染色体共有490032个SNP位点,根据真实标签通过10次随机丛林过程筛选出稳定排序在前20位的SNPs位点,其Cochron-Armitage检验的P值也有10个排进前11位。结论全基因组学数据的处理会消耗很多的时间和内存,普通计算机根本无法承受,Random Jungle软件的命令行运行方式又不易于数据处理、算法组合或嵌套及结果的再现、可视化,在服务器端搭建Rstudio Server并结合Random Jungle的分析策略可有效地应用于全基因组学数据分析,简化分析过程、提高分析速度和效率、方便实现分析结果的动态输出及再现。 Objective Combining the advantages of Rstudio and Random Jungle software to realize the efficient analysis of genome-wide data via remote login. Methods To set up the Rstudio Server,use the packaged Random Jungle R program to analyze the Wellcome Trust Case Control Consortium hypertension GWAS data and generate the dynamic analysis results. Resuits Client side people can login the Rstudio Server through a web browser; When the amount of data is large, the packaged Random Jungle grogram speed obviously faster than the R randomForest package. 2001 cases of HT affected individuals and 3004 controls of 22 chromosomes with total 490032 SNPs. The top 20 SNPs identified by RJ which use actual label on 10 times that also have 10 in top 11 of Cochron-Armitage test P values. Conclusion Genome-wide data processing will cost a lot of time and memory and it incompetent to afford by ordinary computers completely. Random Jungle Software＇ s command-line operation mode may not be easy to data processing, algorithm combinations, reproduction of results and visualizations. The provided analysis of the strategy which combines the Rstudio and Random Jungle software can be well applied in GWAS data analysis. This strategy can simplify the analysis process, improving the speed and efficiency of the data analysis. Furthermore, it is convenient to achieve dynamic output and reproduction.

作者顾星博李昂温琪焦辛妮刘艳

机构地区哈尔滨医科大学卫生统计学教研室哈尔滨医科大学附属第一医院心内科

出处《中国卫生统计》 CSCD 北大核心 2015年第6期955-958,962,共5页 Chinese Journal of Health Statistics

基金国家自然科学基金(30972537 81172741)

关键词全基因组关联分析 SNPS Rstudio 随机森林随机丛林 GWAS SNPs Rstudio Random Forest Random Jungle

分类号 TP311.52 [自动化与计算机技术—计算机软件与理论] R195.1 [医药卫生—卫生统计学]

引文网络
相关文献

参考文献12

1Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinformatics, 2010,26 ( 4 ) : 445-455.
2Rstudio. http ://www. rstudio, com/products/rstudio/.
3Rstudio Server. http://www, rstudio, com/products/rstudio/down- load-server/.
4谢益辉.Sweave:打造一个可重复的统计研究流程.http://cod.name/2010/11/reproducible-research-in-statistics/.
5Arm&age P. Tests for linear trends in proportions and frenquencies. Biometrics, 1955,11 ( 3 ) :375-386.
6Breiman L. Random forests. Machine learning ,2001,45 ( 1 ) :5-32.
7武晓岩,李康.基因表达数据判别分析的随机森林方法[J].中国卫生统计,2006,23(6):491-494. 被引量：21
8Schwarz DF, Konig IR, Ziegler A. On safari to Random Jungle : a fast implementation of Random Forests for high-dimensional data. Bioin- formatics ,2010,29 (14) : 1752-1758.
9WellcomeTrust Case Control, Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature,2007,447 (7145) :661-678.
10Purcell S, B Neale, K Todd-Brown, et al. PLINK: a tool set for whole-genomeassociation and population- based linkage analyses. The American Journal of Human Genetics ,2007,81 (3) :559-575.

二级参考文献4

1Marko R.Improving Random Forests.Machine Learning.ECML Proceedings,Springer,Berlin,2004.
2Ramón D,Sara Alvarez DA.Gene selection and classification of microarray data using random Forest.BMC Bioinformatics,2006,http://www.biomedcentral.com/1471-2105/7/3.
3Liaw A,Wiener M.Classification and regression by randomForest.Rnews,2002,2:18-22.
4Leo B.Random Forests.Statistics Department University of California Berkeley,CA 94720,January 2001.

共引文献20

1郑伟,戴伊宁,孙楠楠,尹乔乔,吴青青,惠田辰,吴文昊,黄海军,童永喜,黄益澄,汪明珊,陈美娟,张家杰,严蓉,高海女,潘红英.应用随机森林模型和Logistic回归模型分析COVID-19的影响因素[J].预防医学,2021,33(7):722-725. 被引量：1
2李文彬,史豪斌,李伟华,齐勇.机器人足球进攻策略的仿真研究[J].计算机仿真,2011,28(2):187-191. 被引量：4
3金如锋,夏昭林.病例对照设计为基础的候选基因关联研究中交互作用的统计方法进展[J].复旦学报（医学版）,2011,38(3):265-270. 被引量：4
4李贞子,张涛,武晓岩,李康.随机森林回归分析及在代谢调控关系研究中的应用[J].中国卫生统计,2012,29(2):158-160. 被引量：32
5范昕,赵桂新,孙萌,张涛.使用随机森林判别分析法预测黑加仑油胶囊治疗高血脂的效果[J].中医药信息,2012,29(4):43-46. 被引量：10
6韩玉,施海龙,曲波,武玉欣,刘洁.随机森林方法在医学中的应用[J].中国预防医学杂志,2014,15(1):79-80. 被引量：27
7李亭,田原,邬伦,刘亮.基于随机森林方法的滑坡灾害危险性区划[J].地理与地理信息科学,2014,30(6):25-30. 被引量：45
8闫树英,陈志宏,惠娜,魏玉辉.基于RF和KNN的三种肝炎分类模型的建立[J].宁夏医学杂志,2015,37(6):496-498. 被引量：1
9陈江鹏,彭斌,文雯,曾庆,唐小静,胡珊,文小焱,阙萍.基于最大相关最小冗余朴素贝叶斯分类器的应用[J].中国卫生统计,2015,32(6):932-934. 被引量：1
10梁慧玲,林玉蕊,杨光,苏漳文,王文辉,郭福涛.基于气象因子的随机森林算法在塔河地区林火预测中的应用[J].林业科学,2016,52(1):89-98. 被引量：53

同被引文献38

1李彪,陈润生.复杂疾病关联分析进展[J].中国医学科学院学报,2006,28(2):271-277. 被引量：9
2张学军.复杂疾病的遗传学研究策略[J].安徽医科大学学报,2007,42(3):237-240. 被引量：31
3Park MY, Hastie T. Penalized logistic regression for detecting gene in- teractions[ J]. Biostatistics, 2008, 9(1) :30-50.
4Genkin A, Lewis DD, Madigan D. Large-scale Bayesian logistic regres- sion for text categorization [ J ]. Technometrics, 2007, 49 ( 3 ) : 29l - 304.
5Kim SJ, Koh K, Lustig M, et al. An interior-point method for large- scale 1 1-regularized least squares [ J]. IEEE J-STSP, 2007, 1 (4) : 606-617.
6Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression [ J]. Biostatisties, 2004, 5(3):427-443.
7Zou H, Hastie T. Regularization and variable selection via the elastic net [J]. J Roy Stat Soc B, 2005, 67(2) :301-320.
8Breiman L. Better subset selection using the non-negative garotte[ J]. Technometrics, 1995,37(4) :373-384.
9Tibshirani R. Regression shrinkage and selection via the lasso [ J]. J Roy StaT Soc B, 1996, 267-288.
10Hoerl AE, Kennard RW. Ridge regression: biased estimation for non- rthogonal problems [ J]. Teehnometrics, 1970, 12(1) :55-67.

引证文献2

1刘匆提,李昂,门志红,姜博,肖纯,刘艳,李贞子.惩罚logistic回归方法在SNPs数据变量筛选研究中的应用[J].实用预防医学,2016,23(11):1395-1399. 被引量：4
2肖纯,史晓雯,刘芸良,张奇,刘艳.候选基因关联研究的统计分析方法[J].中国卫生统计,2017,34(1):181-184.

二级引证文献4

1刘芸良,肖纯,史晓雯,刘艳.三种SNPs数据仿真方法的效能比较[J].实用预防医学,2018,25(2):152-155.
2史晓雯,肖纯,刘芸良,刘艳.三种统计分析方法在基因表达谱数据中的比较研究[J].实用预防医学,2018,25(2):155-159.
3荣雯雯,张奇,刘艳.基于正则化回归的变量选择方法在高维数据中的应用[J].实用预防医学,2018,25(6):645-648. 被引量：4
4张奇,荣雯雯,刘艳.基因表达谱数据分类算法综述[J].实用预防医学,2018,25(8):1018-1022. 被引量：3

1高玉琼,刘文炜,李凤,刘建华,霍昕,杨廼嘉.小儿泻痢片质量标准[J].中国实验方剂学杂志,2011,17(20):99-101.
2孙益,李象钧,赵俊.痛风与中医体质相关性研究[J].西部中医药,2012,25(6):55-57. 被引量：21
3叶景华[J].山东中医药大学学报,1991,0(2):37-38.
4刘圣明.妙用无穷的蒲公英[J].药物与人,2005(3):23-23.
5陈昌通,郑曙光.中医气机理论初探[J].内蒙古中医药,2011,30(22):114-114. 被引量：3
6任彬,杨敏.银杏叶提取物注射液治疗冠心病心绞痛34例临床观察[J].黑龙江医学,2009,33(10):774-775. 被引量：3
7周玉娟,段斐,牛建昭,杨美娟,穆仙丽.复方鳖甲软肝片对高脂性大鼠肝脏脂肪变性的影响[J].中国药业,2004,13(12):31-32. 被引量：2
8本刊专家网上审稿说明[J].中华心血管病杂志,2009,37(9):812-812.
9本刊专家网上审稿说明[J].中华心血管病杂志,2009,37(6):513-513.
10陈友芝,谭萍,姜初明.晚期肝癌的中药治疗——附106例临床资料分析[J].浙江中医学院学报,1990,14(1):14-16. 被引量：3

中国卫生统计

2015年第6期

浏览历史

内容加载中请稍等...

Rstudio和随机丛林在高维全基因组学数据分析中的应用被引量：2

参考文献12

二级参考文献4

共引文献20

同被引文献38

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

Rstudio和随机丛林在高维全基因组学数据分析中的应用 被引量：2

参考文献12

二级参考文献4

共引文献20

同被引文献38

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

Rstudio和随机丛林在高维全基因组学数据分析中的应用被引量：2