摘要
目的:本研究旨在评估多种机器学习模型对非同义突变致病性预测的集成效能,并通过特征重要性分析和多数据集验证各预测工具的贡献和效果。方法:使用27种致病性预测工具对ClinVar数据集和三个外部验证集的非同义突变进行致病性评估,采用均值、中位数和随机森林填补三种方法对缺失值进行处理。使用四种经典机器学习模型(随机森林、神经网络、朴素贝叶斯、极限梯度提升树)集成预测工具,结合三种填补方式构建12个模型。根据内部验证集的准确率和kappa值评估最佳缺失值填补方法,并进一步评估采用该填补方法的四个模型在多项指标上的性能表现。通过特征重要性评分评估各预测工具在集成模型中的重要性,并在内外部验证集中验证。结果:随机森林填补方法在缺失值填补方面表现最佳,平均准确率为0.9080,平均kappa值为0.8087。四种机器学习算法中,极限梯度提升树模型在各项性能指标上综合表现最优,神经网络和随机森林模型的性能表现与极限梯度提升树模型没有明显差异,朴素贝叶斯模型特异性最高、运行时间最短,但kappa值较低。特征重要性评分显示,AlphaMissense、VEST4和MVP是极限梯度提升树模型的核心特征,在内部验证集和三个外部验证集中,AlphaMissense、VEST4和DEOGEN2的AUC值均排在前五。本研究构建的集成预测极端梯度提升树模型在内部验证集的AUC值为0.9763,高于任一单个预测分数,在外部验证集中AUC均在0.96以上。结论:本研究发现,采用随机森林填补缺失值的极限梯度提升树模型在预测非同义突变致病性方面表现最佳,在集成多个预测工具时可考虑使用该模型。AlphaMissense和VEST4等预测工具在集成模型中的贡献显著,具有较高的可信度和准确性,可为非同义突变的致病性提供可靠的预测。
Objective:This study aims to assess the integrated performance of various machine learning models in predicting the pathogenicity of nonsynonymous variant,and to validate the contributions and effects of each prediction tool through feature importance analysis and multiple dataset validation.Methods:Twenty‑seven pathogenicity prediction tools were used to evaluate the pathogenicity of nonsynonymous variants in the ClinVar dataset and three external validation sets,handling missing values with mean,median,and random forest imputation methods.Four classical machine learning models(random forest,neural network,naive bayes,extreme gradient boosting tree)were used to integrate prediction tools,constructing twelve models combined with the three imputation methods.The best imputation method was evaluated based on the accuracy and kappa values of the internal validation set,and the performance of the four models using this imputation method was further assessed on multiple metrics.The importance of each prediction tool in the ensemble model was evaluated using feature importance scoring,and validated in internal and external validation sets.Results:The random forest imputation method performed best in handling missing values,with an average accuracy of 0.9080 and an average kappa value of 0.8087.Among the four machine learning algorithms,the extreme gradient boosting tree model showed the best overall performance across various metrics.The neural network and random forest models had similar performance to the extreme gradient boosting tree model,while the naive bayes model had the highest specificity and shortest runtime but a lower kappa value.Feature importance scores indicated that AlphaMissense,VEST4,and MVP were the core features of the extreme gradient boosting tree model.In both the internal validation set and the three external validation sets,AlphaMissense,VEST4,and DEOGEN2 had AUC values ranking in the top five.The ensemble prediction extreme gradient boosting tree model constructed in this study had an AUC value of 0.9763 in the internal validation set,higher than any single prediction score,with AUC values above 0.96 in the external validation sets.Conclusions:This study found that the extreme gradient boosting tree model,using random forest imputation for missing values,performed best in predicting the pathogenicity of nonsynonymous variant.This model can be considered when integrating multiple prediction tools.Prediction tools such as AlphaMissense and VEST4 made significant contributions to the ensemble model with high predictive reliability and accuracy,which can provide reliable predictions for the pathogenicity of nonsynonymous mutations.
作者
沈茂婷
林俊维
范喜杰
陈涛
陈禹欣
蒙裕欢
于世辉
SHEN Maoting;LIN Junwei;FAN Xijie;CHEN Tao;CHEN Yuxin;MENG Yuhuan;YU Shihui(KingMed School of Laboratory Medicine,Guangzhou Medical University,Guangzhou 511436,Guangdong,China;Guangzhou KingMed Transformative Medicine Institute Co.,Ltd.,Guangzhou 510320,Guangdong,China;Guangzhou Women and Children's Medical Center,Guangzhou Medical University,Guangzhou 510623,Guangdong,China;Guangzhou KingMed Diagnostics Group Co.,Ltd.,Guangzhou 510320,Guangdong,China)
出处
《广州医科大学学报》
2024年第5期1-9,共9页
Academic Journal of Guangzhou Medical University
基金
广州市科技计划项目(2023A03J0540)。
关键词
致病性预测
机器学习
非同义突变
计算预测算法
集成特征分析
pathogenicity prediction
machine learning
nonsynonymous variant
computational predictors
integrated feature analysis