Bacterial genome sequencing is a powerful technique for studying the genetic diversity and evolution ofmicrobial populations.However,the detection of genomic variants from sequencing data is challenging due to the pre...Bacterial genome sequencing is a powerful technique for studying the genetic diversity and evolution ofmicrobial populations.However,the detection of genomic variants from sequencing data is challenging due to the presence of contamination,sequencing errors and multiple strains within the same species.Several bioinformatics tools have been developed to address these issues,but their performance and accuracy have not been systematically evaluated.In this study,we compared 10 variant detection pipelines using 18 simulated and 17 real datasets of high-throughput sequences froma bundle of representative bacteria.We assessed the sensitivity of each pipeline under different conditions of coverage,simulation and strain diversity.We also demonstrated the application of these tools to identify consistentmutations in a 30-time repeated sequencing dataset of Staphylococcus hominis.We found that HaplotypeCaller,but not Mutect2,from the GATK tool set showed the best performance in terms of accuracy and robustness.CFSAN and Snippy performed not as well in several simulated and real sequencing datasets.Our results provided a comprehensive benchmark and guidance for choosing the optimal variant detection pipeline for high-throughput bacterial genome sequencing data.展开更多
基金supported by Zhejiang Provincial Natural Science Foundation(LY20H030006)Key Research&Development Program of Zhejiang(2023C03045)+2 种基金Fundamental Research Funds for the Central Universities(2022ZFJH003)Jinan Microecological Biomedicine Shandong Laboratory(JNL-2022036C)Public Welfare Project of Jinhua City,Zhejiang(2021-4-359).
文摘Bacterial genome sequencing is a powerful technique for studying the genetic diversity and evolution ofmicrobial populations.However,the detection of genomic variants from sequencing data is challenging due to the presence of contamination,sequencing errors and multiple strains within the same species.Several bioinformatics tools have been developed to address these issues,but their performance and accuracy have not been systematically evaluated.In this study,we compared 10 variant detection pipelines using 18 simulated and 17 real datasets of high-throughput sequences froma bundle of representative bacteria.We assessed the sensitivity of each pipeline under different conditions of coverage,simulation and strain diversity.We also demonstrated the application of these tools to identify consistentmutations in a 30-time repeated sequencing dataset of Staphylococcus hominis.We found that HaplotypeCaller,but not Mutect2,from the GATK tool set showed the best performance in terms of accuracy and robustness.CFSAN and Snippy performed not as well in several simulated and real sequencing datasets.Our results provided a comprehensive benchmark and guidance for choosing the optimal variant detection pipeline for high-throughput bacterial genome sequencing data.