摘要
【目的】基于CRISPR序列信息应用机器学习模型预测致泻性大肠埃希菌感染人的潜在风险,并以此识别具有人兽共患风险的高危菌株。【方法】从Enterobase数据库批量获取806株中国分离的致泻性大肠埃希菌基因组序列信息,提取CRISPR位点的间隔序列构造特征,建立机器学习模型并使用交叉验证评价机器学习模型的预测效果。使用最佳模型输出致泻性大肠埃希菌的感染风险,并比较不同动物来源分离株对人的潜在感染风险。【结果】从806株菌株中共获取1093个间隔序列簇,人源分离株独有间隔序列簇为196个,动物源分离株独有间隔序列簇为291个,其中606个二者共享,线性判别分析发现人源和动物源菌株的间隔序列簇分布存在明显差异。以间隔序列簇作为特征,成功构建随机森林模型、逻辑斯谛回归模型、支持向量机模型和梯度提升树模型4种机器学习模型,其宿主预测准确率均超过0.82,受试者工作特征曲线下面积(area under receiver operating characteristic curve,AUC)值均接近0.9。最终确定随机森林模型的分类效果最佳,优化后模型预测准确率为0.844,AUC值为0.915。根据最佳模型输出的致泻性大肠埃希菌的感染风险,猪源分离株感染人的风险最高,羊源分离株感染人的风险较低,极少数禽源分离株可能具备感染人的潜力。【结论】基于间隔序列构建的机器学习模型对具有人兽共患风险的致泻性大肠埃希菌具备一定的识别能力,该模型为传染性疾病防控提供了新思路。
【Objective】This study was aimed to predict the cross-host infection risk of diarrheagenic Escherichia coli and identify the zoonotic isolates based on CRISPR sequences by machine learning.【Method】The genome sequence information of 806 strains of diarrheic Escherichia coli isolated in China was obtained from Enterobase database.The spacer sequence construction features of CRISPR sites were extracted.Subsequently,the machine learning models were established and their performances were evaluated using 10-fold cross-validations.Moreover,the zoonotic risk for each isolates was obtained by the best-fitted model and the zoonotic potential risks with different animal sources were compared.【Result】A total of 1093 spacer sequence clusters were obtained from 806 isolates,containing 196 unique spacer sequence clusters of human,291 unique spacer sequence clusters of animal,and 606 spacer sequence clusters shared between human and animal.Linear discriminant analysis showed that there were significant differences in the distribution of interval sequence clusters between human and animal strains.Subsequently,random forest,logistic regression,support vector machine and gradient boosting decision tree models were established and successfully predicted the source for their accuracy were all>0.82 and their area under receiver operating characteristic curve(AUC)value were all close to 0.9.Finally,the random forest model performed best after optimization,its accuracy was 0.844 and its AUC value was 0.915.According to infected risk of each isolates generated by the best model,the swine isolates displayed the highest risk to infect human,the ovine isolates performed a low risk to infect human,and only a few poultry isolates might exhibit the potential to infect human.【Conclusion】The machine learning model based on spacers sequences could identify isolates with the zoonotic potential,which provided new insights in control and prevention of infectious disease.
作者
冯新元
赵佳雪
龙金照
胡景妍
席岩岩
陈帅印
杨海燕
段广才
FENG Xinyuan;ZHAO Jiaxue;LONG Jinzhao;HU Jingyan;XI Yanyan;CHEN Shuaiyin;YANG Haiyan;DUAN Guangcai(College of Public Health,Zhengzhou University,Zhengzhou 450016,China)
出处
《中国畜牧兽医》
CAS
CSCD
北大核心
2024年第9期4060-4065,共6页
China Animal Husbandry & Veterinary Medicine
基金
河南省科技攻关计划项目(232102310018)
河南省自然科学基金(242300420374)
中国博士后科学基金(2022M712859)
国家科技重大专项(2018ZX10301407)
国家级大学生创新创业项目(202310459009)。
关键词
间隔序列
机器学习
致泻性大肠埃希菌
跨种传播风险预测
spacer sequences
machine learning
diarrheagenic Escherichia coli
cross-host infection risk prediction