摘要
基因中的SNP位点的识别与筛选已成为复杂疾病与基因关联分析研究中日益重要的课题.本文首先对某类疾病基因库采用医学上常用的位点分类方式,分别统计样本总体各个位点的基因频率,从而确定主等位基因与次等位基因,将每个位点的碱基对(A,T,C,G)信息编码转化为数值编码.其次,采用卡方检验方法粗略筛选出可能的SNP位点,最后应用随机森林算法、Bagging、AdaBoost算法、Lasso Logistic算法等机器学习算法筛选出判别结果具有一致性的基因位点,并采用Cross-Validation方法对筛选结果的有效性进行了验证.
The identification and screening of SNP locus in genes has become an increasingly important topic in the study of complex diseases and gene associations.Firstly,This paper adopts the commonly used site classification methods for certain disease gene banks to count the individual sites’ gene frequency which is of the sample separately.This operation can help us determine the primary allele and the minor allele and encode the base pair (A,T,C,G) information of each locus into a numerical code.Secondly,using the chi-square test method to roughly screen the possible SNP loci were used.Finally,the machine learning algorithm such as Random Forest algorithm,Bagging,AdaBoost algorithm and Lasso Logistic algorithm was used to screen the loci with consistent results.The Cross-Validation method was used to check the validity of the screening results.
作者
方雅兰
库在强
FANG Ya-lan;KU Zai-qiang(College of Mathematics and Statistics, Huanggang Normal University,Huanggang 438000,Hubei, China)
出处
《黄冈师范学院学报》
2019年第3期1-5,共5页
Journal of Huanggang Normal University
基金
2018年黄冈师范学院教育硕士教学案例项目(JYJXAL2018001)