摘要
为了解决人类遗传性疾病和性状与基因组上位点间的关联性问题,通过全基因组关联分析,提出一种融合模型,建立了单核苷酸多态性(Single Nucleotide Polymorphisms,SNP)与疾病的关联分析。首先,将16维数据做降维处理;以位点集与类标间的卡方统计量作为评价函数建立基于二阶段蚁群算法的SNP关联分析模型;选取与致病位点相似度最大的位点,构成新的位点集合,建立二元逻辑回归模型,分析遗传疾病与新的位点集合的关联性;并使用随机森林算法验证该模型的准确率。数据测试验证表明了此融合模型的识别率达到85.8%,该模型比传统方法的识别能力有明显增强,可以有效地进行遗传疾病、基因和位点多层次相关性分析。
In order to solve the relationship between human genetic diseases and traits and genomic loci, a fusion model is proposed to establish the association analysis between single Nucleotide Polymorphisms (SNPs) and diseases through genome-wide association analysis. Firstly, the 16-dimensional data is transformed into coding mode to obtain the dimensionality reduction data. Next, the SNP correlation analysis model based on the two-stage ant colony algorithm is established by using the chi-squared statistic between locus set and class standard as the evaluation function. Then, the most similar site to the pathogenic site is selected, which as well as other sites constitutes a new set of loci and establishes a binary logistic regression model, and the association between genetic diseases and new locus sets is analyzed. Finally, the random forest algorithm is used to verify the accuracy of the model. The experimental results show that this fusion model, whose recognition rate reaches 85.8%, is significantly enhanced compared with the recognition ability of the traditional method, and it can effectively carry out genetic disease, gene and site multi-level correlation analysis.
作者
张继荣
寇磊
ZHANG Jirong;KOU Lei(School of Communication and Information Engineering, Xi'an University of Posts and Telecommunications, Xi'an 710121)
出处
《计算机与数字工程》
2019年第9期2165-2169,2175,共6页
Computer & Digital Engineering
关键词
遗传位点
二阶段蚁群算法
随机森林
逻辑回归分析
卡方检测
genetic locus
two-stage ant colony algorithm
random forest
logistic regression analysis
chi-square test