Background:The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture.Existing reference-based and gene h...Background:The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture.Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.Methods:Here we developed a reference-free and alignment-free machine learning method,DeepVirFinder,for identifying viral sequences in metagenomic data using deep learning.Results'.Trained based on sequences from viral RefSeq discovered before May 2015,and evaluated on those discovered after that date,DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths,achieving AUROC 0.93,0.95,0.97,and 0.98 for 300,500,1000,and 3000 bp sequences respectively.Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented.Applying DeepVirFinder to real human gut metagenomic samples,we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma(CRC).Ten bins were found associated with the cancer status,suggesting viruses may play important roles in CRC.Conclusions:Powered by deep learning and high throughput sequencing metagenomic data,DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.展开更多
Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many stu...Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.展开更多
Background:Markov chains(MC)have been widely used to model molecular sequences.The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensi...Background:Markov chains(MC)have been widely used to model molecular sequences.The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades.In next generation sequencing(NGS),a large amount of short reads are generated.These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data.Based on NGS data,the transition probabilities of MC can be estimated by moment estimators.However,the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.Methods:In this study,we present the asymptotic distributions of several statistics related to MC based on NGS data.We show that,after scaling by the effective coverage d defined in a previous study by the authors,these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.Results:We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data.We validate our theoretical confidence intervals using both simulated data and real data sets,and compare the results with those by the parametric bootstrap method.Conclusions:We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate,providing a powerful tool for NGS data analysis.展开更多
The International Workshop on Applications of Probability and Statistics to Biology(APSB)was successfully held in Shanghai,China,July 11-13,2019.The workshop was hosted by the Institute of Science and Technology for B...The International Workshop on Applications of Probability and Statistics to Biology(APSB)was successfully held in Shanghai,China,July 11-13,2019.The workshop was hosted by the Institute of Science and Technology for Brain-inspired Intelligence(ISTBI)at Fudan University,and in honor of the 80th birthday of Prof.Minping Qian of Peking University.Most of the twenty eight speakers were former students or close collaborators of Prof.Qian;and there were over eighty participants from all over China and United States.展开更多
RECOMB 2013 was successfully held in Tsinghua University, Beijing, China on April 7-10, 2013, hosted by the Bioinformatics Division and Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Inform...RECOMB 2013 was successfully held in Tsinghua University, Beijing, China on April 7-10, 2013, hosted by the Bioinformatics Division and Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology (TNLIST). A total of about 500 professionals from both academia and industry from 29 countries and regions attended the conference and its RECOMB-Seq satellite workshop after the main conference.展开更多
基金The research was supported by the U.S.National Institutes of Health R01GM120624,National Science Foundation DMS-1518001,National Natural Science Foundation of China(11701546)the Simons Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems(CBIOMES+1 种基金grant ID 549943)We thank Drs.Michael S.Waterman,Gesine Reinert,Ying Wang,Rui Jiang,Yang Lu,Lizzie Dorfrnan,Mr.Weili Wang,and Mr.Luigi Manna for helpful discussions and suggestions.We thank USC Center for High Performance Computing(HPC)for helping us use their cluster computers.
文摘Background:The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture.Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.Methods:Here we developed a reference-free and alignment-free machine learning method,DeepVirFinder,for identifying viral sequences in metagenomic data using deep learning.Results'.Trained based on sequences from viral RefSeq discovered before May 2015,and evaluated on those discovered after that date,DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths,achieving AUROC 0.93,0.95,0.97,and 0.98 for 300,500,1000,and 3000 bp sequences respectively.Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented.Applying DeepVirFinder to real human gut metagenomic samples,we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma(CRC).Ten bins were found associated with the cancer status,suggesting viruses may play important roles in CRC.Conclusions:Powered by deep learning and high throughput sequencing metagenomic data,DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
文摘Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.
基金Supported by NSFC grants(Nos.11571349 and 91630314)the National Key R&D Program of China under Grant 2018YFB0704304,NCMIS of CAS,LSC of CAS+1 种基金the Youth Innovation Promotion Association of CAS.JR and FS were supported by US National Science Foundation(NSF)(DMS-1518001)National Institutes of Health(NIH)(R01GM120624,1R01GM131407).
文摘Background:Markov chains(MC)have been widely used to model molecular sequences.The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades.In next generation sequencing(NGS),a large amount of short reads are generated.These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data.Based on NGS data,the transition probabilities of MC can be estimated by moment estimators.However,the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.Methods:In this study,we present the asymptotic distributions of several statistics related to MC based on NGS data.We show that,after scaling by the effective coverage d defined in a previous study by the authors,these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.Results:We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data.We validate our theoretical confidence intervals using both simulated data and real data sets,and compare the results with those by the parametric bootstrap method.Conclusions:We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate,providing a powerful tool for NGS data analysis.
文摘The International Workshop on Applications of Probability and Statistics to Biology(APSB)was successfully held in Shanghai,China,July 11-13,2019.The workshop was hosted by the Institute of Science and Technology for Brain-inspired Intelligence(ISTBI)at Fudan University,and in honor of the 80th birthday of Prof.Minping Qian of Peking University.Most of the twenty eight speakers were former students or close collaborators of Prof.Qian;and there were over eighty participants from all over China and United States.
文摘RECOMB 2013 was successfully held in Tsinghua University, Beijing, China on April 7-10, 2013, hosted by the Bioinformatics Division and Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology (TNLIST). A total of about 500 professionals from both academia and industry from 29 countries and regions attended the conference and its RECOMB-Seq satellite workshop after the main conference.