Background Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation.Joint calling is routinely used to combine identified variants across ...Background Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation.Joint calling is routinely used to combine identified variants across multiple related samples.However,the improvement of variants identification using the mutual support information from mul-tiple samples remains quite limited for population-scale genotyping.Results In this study,we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples’data.The variants were accurately identified from multiple samples by using four steps:(1)Probabilities of variants from two widely used algorithms,GATK and Freebayes,were calculated by Poisson model incorporating base sequencing error potential;(2)The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification(rHID)variants database;(3)The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate(FDR)using rHID database;(4)To avoid the elimination of potentially true variants from rHID database,the vari-ants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants.The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32%compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number(GPC5),scrapie pathology(PAPSS2),sea-sonal reproduction and litter size(GRM1),coat color(RAB27A),and lentivirus susceptibility(TMEM154).Conclusion The new method used the computational strategy to reduce the number of false positives,and simulta-neously improve the identification of genetic variants.This strategy did not incur any extra cost by using any addi-tional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.展开更多
The Human Genome Project opened an era of(epi)genomic research,and also provided a platform for the development of new sequencing technologies.During and after the project,several sequencing technologies continue to d...The Human Genome Project opened an era of(epi)genomic research,and also provided a platform for the development of new sequencing technologies.During and after the project,several sequencing technologies continue to dominate nucleic acid sequencing markets.Currently,Illumina(short-read),PacBio(long-read),and Oxford Nanopore(longread)are the most popular sequencing technologies.Unlike PacBio or the popular short-read sequencers before it,which,as examples of the second or so-called Next-Generation Sequencing platforms,need to synthesize when sequencing,nanopore technology directly sequences native DNA and RNA molecules.Nanopore sequencing,therefore,avoids converting mRNA into cDNA molecules,which not only allows for the sequencing of extremely long native DNA and full-length RNA molecules but also document modifications that have been made to those native DNA or RNA bases.In this review on direct DNA sequencing and direct RNA sequencing using Oxford Nanopore technology,we focus on their development and application achievements,discussing their challenges and future perspective.We also address the problems researchers may encounter applying these approaches in their research topics,and how to resolve them.展开更多
基金Superior Farms sheep producersIBEST for their supportfinancial support from the Idaho Global Entrepreneurial Mission
文摘Background Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation.Joint calling is routinely used to combine identified variants across multiple related samples.However,the improvement of variants identification using the mutual support information from mul-tiple samples remains quite limited for population-scale genotyping.Results In this study,we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples’data.The variants were accurately identified from multiple samples by using four steps:(1)Probabilities of variants from two widely used algorithms,GATK and Freebayes,were calculated by Poisson model incorporating base sequencing error potential;(2)The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification(rHID)variants database;(3)The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate(FDR)using rHID database;(4)To avoid the elimination of potentially true variants from rHID database,the vari-ants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants.The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32%compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number(GPC5),scrapie pathology(PAPSS2),sea-sonal reproduction and litter size(GRM1),coat color(RAB27A),and lentivirus susceptibility(TMEM154).Conclusion The new method used the computational strategy to reduce the number of false positives,and simulta-neously improve the identification of genetic variants.This strategy did not incur any extra cost by using any addi-tional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.
基金supported by the Key-Areas Research and Development Program of Guangdong Province(2020B020220004)the Youth Innovation Promotion Association,Chinese Academy of Sciences(2017399)+2 种基金the Science and Technology Program of Guangzhou(202002030097)the Hong Kong Research Grants Council Area of Excellence Scheme(AoE/M-403/16),the ECS(27204518)TRS of the HKSAR government(T21-705/20-N).
文摘The Human Genome Project opened an era of(epi)genomic research,and also provided a platform for the development of new sequencing technologies.During and after the project,several sequencing technologies continue to dominate nucleic acid sequencing markets.Currently,Illumina(short-read),PacBio(long-read),and Oxford Nanopore(longread)are the most popular sequencing technologies.Unlike PacBio or the popular short-read sequencers before it,which,as examples of the second or so-called Next-Generation Sequencing platforms,need to synthesize when sequencing,nanopore technology directly sequences native DNA and RNA molecules.Nanopore sequencing,therefore,avoids converting mRNA into cDNA molecules,which not only allows for the sequencing of extremely long native DNA and full-length RNA molecules but also document modifications that have been made to those native DNA or RNA bases.In this review on direct DNA sequencing and direct RNA sequencing using Oxford Nanopore technology,we focus on their development and application achievements,discussing their challenges and future perspective.We also address the problems researchers may encounter applying these approaches in their research topics,and how to resolve them.