摘要
目的 通过对单核苷酸多态性(SNPs)数据展开缺失值填补影响因素与填补效果的研究,为利用SNPs数据进行基因与疾病的关联研究提供科学依据。方法 以国际人类基因组单体型图计划(Hap Map计划)数据为原始数据,利用HAPGEN2软件,依据原始数据生成SNP基因型模拟数据,人为产生缺失数据并进行缺失值的填补,分析不同条件(4个水平的缺失比例、4个水平的参考数据样本量)的填补错误率。结果 数据缺失比例越小、参考数据样本量越大,填补的错误率越低(样本量50、100、150和200的平均错误率分别为7.01%、5.92%、5.67%和5.26%);2种缺失模式在缺失比例较大时(r^2=0.825),随机缺失填补(平均5.64%)较固定缺失填补(平均9.10%)填补错误率低,而当缺失比例较小时(r^2=0.9),固定位点缺失模式的填补错误率较低(平均4.96%),在各种条件下IMPUTE2的填补错误率为3%~13%。结论 缺失比例、参考数据样本量以及缺失模式对缺失数据填补的准确性有一定影响;对标签SNP数据进行缺失值填补,再进一步分析是一种有效的策略。
Objective To study the effect and influencing factors of missing data imputation of single nucleotide polymorphisms (SNPs) test and to provide a scientific basis for the use of SNPs data in gene and disease association studies. Methods Human genome from International HapMap Project was used as raw data and Haploview software was used for tag SNP screening. HAPGEN2 software was adopted to simulate SNP reference data and the research data with simulated missing data. Then the research data were imputed with IMPUTE2 software based on reference data and the error rates of the imputations at different conditions( four levels of the missing ratio and the sample size of reference data) were compared. Results The imputation error rate was positively associated with the proportion of missing data and inversely with the sample size of reference data, with the error rates of 7. 01%, 5.92 % ,5.67 %, and 5.26 % corresponding to the reference data sample sizes of 50,100,150, and 200 repectively. The error rate of random site imputation( 5.64% ) was lower than that of tag SNP imputation( 9.10% ) when there was a large missing proportion( r^2 = 0. 825 ) and on the other hand using tag SNP imputation could fill the data at a lower error rate(4. 96% ) when there was a small missing proportion( r^2 = 0.9 ). The imputation results showed that IMPUTE2 software resulted in low error rates (3 % - 13 % ) at different situations. Conclusion The proportion of missing data, reference data sample size, and different missing patterns have influences on imputation error rate. Selecting a subset of aim gene and then imputing the data is a good strategy in analyses.
出处
《中国公共卫生》
CAS
CSCD
北大核心
2014年第12期1576-1582,共7页
Chinese Journal of Public Health
基金
国家自然科学基金(81172741
30972537)