Next-generation sequencing technologies both boost the discovery of variants in the human genome and exacerbate the challenges of pathogenic variant identification.In this study,we developed Pathogenicity Prediction T...Next-generation sequencing technologies both boost the discovery of variants in the human genome and exacerbate the challenges of pathogenic variant identification.In this study,we developed Pathogenicity Prediction Tool for missense variants(mvPPT),a highly sensitive and accurate missense variant classifier based on gradient boosting.mvPPT adopts high-confidence training sets with a wide spectrum of variant profiles,and extracts three categories of features,including scores from existing prediction tools,frequencies(allele frequencies,amino acid frequencies,and genotype frequencies),and genomic context.Compared with established predictors,mvPPT achieves superior performance in all test sets,regardless of data source.In addition,our study also provides guidance for training set and feature selection strategies,as well as reveals highly relevant features,which may further provide biological insights into variant pathogenicity.展开更多
基金supported by the National Key R&D Program of China(Grant No.2021ZD0202500)the Shanghai Natural Science Foundation,China(Grant No.20ZR1403800)+3 种基金the National Natural Science Foundation of China(Grant Nos.31900476,82071259,31930044,and 31725012)the Shanghai Municipal Science and Technology Major Project(Grant No.2018SHZDZX01)ZJ Lab,the Shanghai Center for Brain Science and Brain-Inspired Technology,China,the Foundation of Shanghai Municipal Education Commission,China(Grant No.2019-01-07-00-07-E00062)the Collaborative Innovation Program of Shanghai Municipal Health Commission,China(Grant No.2020CXJQ01).
文摘Next-generation sequencing technologies both boost the discovery of variants in the human genome and exacerbate the challenges of pathogenic variant identification.In this study,we developed Pathogenicity Prediction Tool for missense variants(mvPPT),a highly sensitive and accurate missense variant classifier based on gradient boosting.mvPPT adopts high-confidence training sets with a wide spectrum of variant profiles,and extracts three categories of features,including scores from existing prediction tools,frequencies(allele frequencies,amino acid frequencies,and genotype frequencies),and genomic context.Compared with established predictors,mvPPT achieves superior performance in all test sets,regardless of data source.In addition,our study also provides guidance for training set and feature selection strategies,as well as reveals highly relevant features,which may further provide biological insights into variant pathogenicity.