基于机器学习的非同义突变致病性预测及其特征重要性分析

Prediction of nonsynonymous variant pathogenicity and feature importance analysis based on machine learning

下载PDF

导出

摘要目的:本研究旨在评估多种机器学习模型对非同义突变致病性预测的集成效能,并通过特征重要性分析和多数据集验证各预测工具的贡献和效果。方法:使用27种致病性预测工具对ClinVar数据集和三个外部验证集的非同义突变进行致病性评估,采用均值、中位数和随机森林填补三种方法对缺失值进行处理。使用四种经典机器学习模型(随机森林、神经网络、朴素贝叶斯、极限梯度提升树)集成预测工具,结合三种填补方式构建12个模型。根据内部验证集的准确率和kappa值评估最佳缺失值填补方法,并进一步评估采用该填补方法的四个模型在多项指标上的性能表现。通过特征重要性评分评估各预测工具在集成模型中的重要性,并在内外部验证集中验证。结果:随机森林填补方法在缺失值填补方面表现最佳,平均准确率为0.9080,平均kappa值为0.8087。四种机器学习算法中,极限梯度提升树模型在各项性能指标上综合表现最优,神经网络和随机森林模型的性能表现与极限梯度提升树模型没有明显差异,朴素贝叶斯模型特异性最高、运行时间最短,但kappa值较低。特征重要性评分显示,AlphaMissense、VEST4和MVP是极限梯度提升树模型的核心特征,在内部验证集和三个外部验证集中,AlphaMissense、VEST4和DEOGEN2的AUC值均排在前五。本研究构建的集成预测极端梯度提升树模型在内部验证集的AUC值为0.9763,高于任一单个预测分数,在外部验证集中AUC均在0.96以上。结论:本研究发现,采用随机森林填补缺失值的极限梯度提升树模型在预测非同义突变致病性方面表现最佳,在集成多个预测工具时可考虑使用该模型。AlphaMissense和VEST4等预测工具在集成模型中的贡献显著,具有较高的可信度和准确性,可为非同义突变的致病性提供可靠的预测。 Objective:This study aims to assess the integrated performance of various machine learning models in predicting the pathogenicity of nonsynonymous variant,and to validate the contributions and effects of each prediction tool through feature importance analysis and multiple dataset validation.Methods:Twenty‑seven pathogenicity prediction tools were used to evaluate the pathogenicity of nonsynonymous variants in the ClinVar dataset and three external validation sets,handling missing values with mean,median,and random forest imputation methods.Four classical machine learning models(random forest,neural network,naive bayes,extreme gradient boosting tree)were used to integrate prediction tools,constructing twelve models combined with the three imputation methods.The best imputation method was evaluated based on the accuracy and kappa values of the internal validation set,and the performance of the four models using this imputation method was further assessed on multiple metrics.The importance of each prediction tool in the ensemble model was evaluated using feature importance scoring,and validated in internal and external validation sets.Results:The random forest imputation method performed best in handling missing values,with an average accuracy of 0.9080 and an average kappa value of 0.8087.Among the four machine learning algorithms,the extreme gradient boosting tree model showed the best overall performance across various metrics.The neural network and random forest models had similar performance to the extreme gradient boosting tree model,while the naive bayes model had the highest specificity and shortest runtime but a lower kappa value.Feature importance scores indicated that AlphaMissense,VEST4,and MVP were the core features of the extreme gradient boosting tree model.In both the internal validation set and the three external validation sets,AlphaMissense,VEST4,and DEOGEN2 had AUC values ranking in the top five.The ensemble prediction extreme gradient boosting tree model constructed in this study had an AUC value of 0.9763 in the internal validation set,higher than any single prediction score,with AUC values above 0.96 in the external validation sets.Conclusions:This study found that the extreme gradient boosting tree model,using random forest imputation for missing values,performed best in predicting the pathogenicity of nonsynonymous variant.This model can be considered when integrating multiple prediction tools.Prediction tools such as AlphaMissense and VEST4 made significant contributions to the ensemble model with high predictive reliability and accuracy,which can provide reliable predictions for the pathogenicity of nonsynonymous mutations.

作者沈茂婷林俊维范喜杰陈涛陈禹欣蒙裕欢于世辉 SHEN Maoting;LIN Junwei;FAN Xijie;CHEN Tao;CHEN Yuxin;MENG Yuhuan;YU Shihui(KingMed School of Laboratory Medicine,Guangzhou Medical University,Guangzhou 511436,Guangdong,China;Guangzhou KingMed Transformative Medicine Institute Co.,Ltd.,Guangzhou 510320,Guangdong,China;Guangzhou Women and Children's Medical Center,Guangzhou Medical University,Guangzhou 510623,Guangdong,China;Guangzhou KingMed Diagnostics Group Co.,Ltd.,Guangzhou 510320,Guangdong,China)

机构地区广州医科大学金域检验学院广州市金域转化医学研究院有限公司广州医科大学附属妇女儿童医疗中心广州金域医学检验集团股份有限公司

出处《广州医科大学学报》 2024年第5期1-9,共9页 Academic Journal of Guangzhou Medical University

基金广州市科技计划项目(2023A03J0540)。

关键词致病性预测机器学习非同义突变计算预测算法集成特征分析 pathogenicity prediction machine learning nonsynonymous variant computational predictors integrated feature analysis

分类号 Q811.4 [生物学—生物工程] TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献1

1陈娟,王献雨,罗玲玲,崔晶晶.缺失值填补效果:机器学习与统计学习的比较[J].统计与决策,2020(17):28-32. 被引量：20

二级参考文献4

1赵磊,李国和,马现峰.基于支持向量机的缺失数据补齐方法[J].计算机工程与应用,2006,42(36):207-208. 被引量：5
2张婵.一种基于支持向量机的缺失值填补算法[J].计算机应用与软件,2013,30(5):226-228. 被引量：15
3张赤,丰洪才,金凯,杨婷.基于聚类分析的缺失数据最近邻填补算法[J].计算机应用与软件,2014,31(5):282-284. 被引量：13
4梁秉毅,蔡延光,蔡颢,戚远航,黄何列,Ole Hejlesen.基于优化决策树和EM的缺失数据填充算法[J].自动化与信息工程,2017,38(5):37-43. 被引量：2

共引文献19

1胡康,曹丽梅,高志峰,邵俊娟,路勇,李健.销量预测算法在药品标准物质管理中的应用[J].中国药学杂志,2021,56(16):1336-1341. 被引量：2
2陈干霞.SPSS重复测量方差分析方法介绍[J].实用老年医学,2021,35(9):900-905. 被引量：7
3樊东醒,叶春明.一种面向高维缺失不平衡数据的信用评估方法[J].计算机应用研究,2021,38(9):2667-2672. 被引量：3
4袁建裕,闫春艳,叶志伟,杨志勇.离散型缺失数据填补法综合比较[J].湖北工业大学学报,2021,36(5):59-63. 被引量：2
5游东东,黎家良,刘高俊,杨汕.基于贝叶斯BiLSTM模型的核电阀位传感器故障预警方法[J].华南理工大学学报（自然科学版）,2021,49(12):43-52. 被引量：7
6张婷暄,邓久宁,汪洁,孙怀宇.基于主成分分析与K近邻分类算法的化工干燥分类模型[J].辽宁化工,2022,51(5):696-699. 被引量：2
7温廷新,苏焕博.基于链式多重插补的WOA-ELM煤与瓦斯突出预测模型[J].中国安全生产科学技术,2022,18(7):68-74. 被引量：4
8程麒铭,陈垚,刘臻,唐颖辉,袁绍春.基于随机森林-投影寻踪法的生物滞留系统多目标评价方法[J].水资源与水工程学报,2022,33(4):85-90. 被引量：4
9温廷新,苏焕博.基于MICE_RF的组合赋权—极限随机树岩爆预测模型[J].黄金科学技术,2022,30(3):392-403.
10翟小伟,罗金雷,张羽琛,宋波波,郝乐,周妤婕.基于数据填补的煤自燃温度预测模型[J].工矿自动化,2023,49(1):28-35. 被引量：5

1朱荣慧,秦婴逸,吴骋.三种缺失机制下数据模拟方法及其SAS实现[J].中国卫生统计,2024,41(5):762-765.
2黄素芬,周胜强.注射用胞磷胆碱联合认知训练对帕金森病轻度认知障碍的影响[J].吉林医学,2024,45(12):3034-3037.
3沈宁.基于数据驱动的智能电网光伏能源预测方法研究[J].能源与环保,2024,46(11):193-197.
4李鹏.SHP-1基因在卵巢癌中的表达及其与患者预后关系生物信息学分析[J].首都食品与医药,2024,31(23):14-17.
5柏文学.从神农的口袋到食品安全国家标准[J].标准生活,2024(6):62-64.
6姬清华,徐伟,张虎山,马俊蕊,黄飞,伍龙,郭伟,张汝一,梁正子.SMAD4突变型结直肠癌患者基因突变及肿瘤免疫微环境特征分析[J].现代免疫学,2024,44(6):506-512.
7王森,张志霄.基于概率模型与人工智能的下一代数据分类与链接创新研究[J].电脑知识与技术,2024,20(31):71-73.
8王超,吕蓉,胡晓玉,赵娜娜.CD36抗原缺失Ⅰ型4种碱基突变及蛋白结构生物信息学分析[J].临床输血与检验,2024,26(6):726-735.
9陈风.基于模糊粗糙集的工程地形测量数据缺失填补研究[J].资源导刊,2024(22):30-33.
10邹泽林,胡觉,王金池,李锐,程霞,黄鑫.基于Sentinel-2数据的林分平均树高和平均胸径估测研究[J].中南林业调查规划,2024,43(4):39-45.

广州医科大学学报

2024年第5期

浏览历史

内容加载中请稍等...

基于机器学习的非同义突变致病性预测及其特征重要性分析

参考文献1

二级参考文献4

共引文献19

相关作者

相关机构

相关主题

浏览历史