摘要
利用宏基因组分析预测人类疾病和健康状况以及发现生物标志物是当前研究的热点。通过生物信息学工具KneadData和MetaPhlAn2对原始宏基因组进行数据质量控制和去宿主污染后得到纯净序列,利用数据降维方法和随机森林模型筛选出与疾病发生高度相关的特征菌群,以代替原始数据特征作为疾病预测模型输入。结合多层感知机(multilayer perceptron,MLP)、支持向量机(support vector machine,SVM)和极端梯度提升(extreme gradient boosting,XGBoost)为子模型构建融合疾病预测模型,并在肝硬化、2型糖尿病和肥胖症3个数据集上经过特征筛选后交叉验证,得到的AUC值分别为0.9286、0.6521和0.5747。ROC曲线下面积显示,筛选出特征菌群后的模型能高效准确地筛查和诊断疾病,并能有效区分健康人和疾病患者,为建立一种新的非侵入性、可量化的辅助诊断方法提供了有益参考。
The utilization of metagenomic analysis to investigate human diseases and predict health conditions is a current focal point of research.Through the application of bioinformatics tools such as KneadData and MetaPhlAn2,the raw metagenomic data undergoes quality control and host contamination removal were carried out to obtain the pure sequences.Subsequently,dimensionality reduction methods and a random forest model were employed to identify microbial taxa that were highly correlated with disease occurrence,serving as replacements for the original data features in the disease prediction model.A fusion disease prediction model was constructed by integrating multilayer perceptron(MLP),support vector machine(SVM),and extreme gradient boosting(XGBoost)as sub-models.Following feature selection and cross-validation on datasets pertaining to liver cirrhosis,type 2 diabetes,and obesity,the obtained AUC values were 0.9286,0.6521,and 0.5747,respectively.The area under the ROC curve demonstrated that the model augmented with the selected microbial taxa,which could efficiently and accurately screen and diagnose diseases,effectively distinguishing between healthy individuals and patients.This work provided valuable insights for the establishment of a novel non-invasive and quantifiable auxiliary diagnostic method.
作者
曹海涛
朱静
曾海波
刘彦辰
CAO Haitao;ZHU Jing;ZENG Haibo;LIU Yanchen(Computer and Information Engineering College,Xinjiang Agricultural University,Urumqi 830052,China;Friendship Hospital of Urumqi,Urumqi 830049,China)
出处
《生物技术进展》
2023年第5期798-806,共9页
Current Biotechnology
基金
国家自然科学基金项目(31860649)。
关键词
疾病预测
肠道菌群
特征筛选
融合模型
宏基因组
disease prediction
intestinal microbiota
feature screening
fusion model
metagenomics