摘要
目的结合Rstudio和Random Jungle两款软件的优势,通过远程登录简便、快捷地实现全基因组学数据的分析。方法在服务器端搭建Rstudio Server,封装随机丛林R程序并对英国威康信托病例对照协会(WTCCC)高血压真实数据进行分析,通过Rstudio Sweave动态生成分析结果。结果在客户端即可通过网络浏览器登陆服务器端搭建好的Rstudio Server;数据量较大时,封装后的随机丛林程序在Rstudio Server中运算速度方面相对于R random Forest包优势明显。2001名高血压病例及3004名对照的22条染色体共有490032个SNP位点,根据真实标签通过10次随机丛林过程筛选出稳定排序在前20位的SNPs位点,其Cochron-Armitage检验的P值也有10个排进前11位。结论全基因组学数据的处理会消耗很多的时间和内存,普通计算机根本无法承受,Random Jungle软件的命令行运行方式又不易于数据处理、算法组合或嵌套及结果的再现、可视化,在服务器端搭建Rstudio Server并结合Random Jungle的分析策略可有效地应用于全基因组学数据分析,简化分析过程、提高分析速度和效率、方便实现分析结果的动态输出及再现。
Objective Combining the advantages of Rstudio and Random Jungle software to realize the efficient analysis of genome-wide data via remote login. Methods To set up the Rstudio Server,use the packaged Random Jungle R program to analyze the Wellcome Trust Case Control Consortium hypertension GWAS data and generate the dynamic analysis results. Resuits Client side people can login the Rstudio Server through a web browser; When the amount of data is large, the packaged Random Jungle grogram speed obviously faster than the R randomForest package. 2001 cases of HT affected individuals and 3004 controls of 22 chromosomes with total 490032 SNPs. The top 20 SNPs identified by RJ which use actual label on 10 times that also have 10 in top 11 of Cochron-Armitage test P values. Conclusion Genome-wide data processing will cost a lot of time and memory and it incompetent to afford by ordinary computers completely. Random Jungle Software' s command-line operation mode may not be easy to data processing, algorithm combinations, reproduction of results and visualizations. The provided analysis of the strategy which combines the Rstudio and Random Jungle software can be well applied in GWAS data analysis. This strategy can simplify the analysis process, improving the speed and efficiency of the data analysis. Furthermore, it is convenient to achieve dynamic output and reproduction.
出处
《中国卫生统计》
CSCD
北大核心
2015年第6期955-958,962,共5页
Chinese Journal of Health Statistics
基金
国家自然科学基金(30972537
81172741)