Genomic data serve as an invaluable resource for unraveling the intricacies of the higher plant systems,including the constituent elements within and among species.Through various efforts in genomic data archiving,int...Genomic data serve as an invaluable resource for unraveling the intricacies of the higher plant systems,including the constituent elements within and among species.Through various efforts in genomic data archiving,integrative analysis and value-added curation,the National Genomics Data Center(NGDC),which is a part of the China National Center for Bioinformation(CNCB),has successfully established and currently maintains a vast amount of database resources.This dedicated initiative of the NGDC facilitates a data-rich ecosystem that greatly strengthens and supports genomic research efforts.Here,we present a comprehensive overview of central repositories dedicated to archiving,presenting,and sharing plant omics data,introduce knowledgebases focused on variants or gene-based functional insights,highlight species-specific multiple omics database resources,and briefly review the online application tools.We intend that this review can be used as a guide map for plant researchers wishing to select effective data resources from the NGDC for their specific areas of study.展开更多
Biological databases serve as a global fundamental infrastructure for the worldwide scientific community,which dramatically aid the transformation of big data into knowledge discovery and drive significant innovations...Biological databases serve as a global fundamental infrastructure for the worldwide scientific community,which dramatically aid the transformation of big data into knowledge discovery and drive significant innovations in a wide range of research fields.Given the rapid data production,biological databases continue to increase in size and importance.To build a catalog of worldwide biological databases,we curate a total of 5825 biological databases from 8931 publications,which are geographically distributed in 72 countries/regions and developed by 1975 institutions(as of September 20,2022).We further devise a z-index,a novel index to characterize the scientific impact of a database,and rank all these biological databases as well as their hosting institutions and countries in terms of citation and z-index.Consequently,we present a series of statistics and trends of worldwide biological databases,yielding a global perspective to better understand their status and impact for life and health sciences.An up-to-date catalog of worldwide biological databases,as well as their curated meta-information and derived statistics,is publicly available at Database Commons(https://ngdc.cncb.ac.cn/databasecommons/).展开更多
Genome data of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)is essential for virus diagnosis,vaccine development,and variant surveillance.To archive and integrate worldwide SARS-CoV-2 genome data,a serie...Genome data of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)is essential for virus diagnosis,vaccine development,and variant surveillance.To archive and integrate worldwide SARS-CoV-2 genome data,a series of resources have been constructed,serving as a fundamental infrastructure for SARS-CoV-2 research,pandemic prevention and control,and coronavirus disease 2019(COVID-19)therapy.Here we present an over-view of extant SARS-CoV-2 resources that are devoted to genome data deposition and integration.We review deposition resources in data accessibility,metadata standardization,data curation and annotation;review integrative resources in data source,de-redundancy processing,data curation and quality assessment,and variant annotation.Moreover,we address issues that impede SARS-CoV-2 genome data integration,including low-complexity,inconsistency and absence of isolate name,sequence inconsistency,asynchronous update of genome data,and mismatched metadata.We finally provide insights into data standardization consensus and data submission guidelines,to promote SARS-CoV-2 genome data sharing and integration.展开更多
The Resource for Coronavirus 2019(RCoV19)is an open-access information resource dedicated to providing valuable data on the genomes,mutations,and variants of the severe acute respiratory syndrome coronavirus 2(SARS-Co...The Resource for Coronavirus 2019(RCoV19)is an open-access information resource dedicated to providing valuable data on the genomes,mutations,and variants of the severe acute respiratory syndrome coronavirus 2(SARS-CoV-2).In this updated implementation of RCoV19,we have made significant improvements and advancements over the previous version.Firstly,we have implemented a highly refined genome data curation model.This model now features an automated integration pipeline and optimized curation rules,enabling efficient daily updates of data in RCoV19.Secondly,we have developed a global and regional lineage evolution monitoring platform,alongside an outbreak risk pre-warning system.These additions provide a comprehensive understanding of SARS-CoV-2 evolution and transmission patterns,enabling better preparedness and response strategies.Thirdly,we have developed a powerful interactive mutation spectrum comparison module.This module allows users to compare and analyze mutation patterns,assisting in the detection of potential new lineages.Furthermore,we have incorporated a comprehensive knowledgebase on mutation effects.This knowledgebase serves as a valuable resource for retrieving information on the functional implications of specific mutations.In summary,RCoV19 serves as a vital scientific resource,providing access to valuable data,relevant information,and technical support in the global fight against COVID-19.The complete contents of RCoV19 are available to the public at https://ngdc.cncb.ac.cn/ncov/.展开更多
Background Big data challenges In the late 1980s and early 1990s,three major international biological data centers were created:the DNA Database of Japan(DDBJ)[1],the European Bioinformatics Institute(EMBL-EBI)in the ...Background Big data challenges In the late 1980s and early 1990s,three major international biological data centers were created:the DNA Database of Japan(DDBJ)[1],the European Bioinformatics Institute(EMBL-EBI)in the United Kingdom(UK)[2],and the National Center for Biotechnology Information(NCBI)in the United States(US)[3].展开更多
The Genome Sequence Archive(GSA)is a data repository for archiving raw sequence data,which provides data storage and sharing services for worldwide scientific communities.Considering explosive data growth with diverse...The Genome Sequence Archive(GSA)is a data repository for archiving raw sequence data,which provides data storage and sharing services for worldwide scientific communities.Considering explosive data growth with diverse data types,here we present the GSA family by expanding into a set of resources for raw data archive with different purposes,namely,GSA(https://ngdc.cncb.ac.cn/gsa/),GSA for Human(GSA-Human,https://ngdc.cncb.ac.cn/gsa-human/),and Open Archive for Miscellaneous Data(OMIX,https://ngdc.cncb.ac.cn/omix/).Compared with the 2017 version,GSA has been significantly updated in data model,online functionalities,and web interfaces.GSA-Human,as a new partner of GSA,is a data repository specialized in human genetics-related data with controlled access and security.OMIX,as a critical complement to the two resources mentioned above,is an open archive for miscellaneous data.Together,all these resources form a family of resources dedicated to archiving explosive data with diverse types,accepting data submissions from all over the world,and providing free open access to all publicly available data in support of worldwide research activities.展开更多
The Genome Warehouse(GWH)is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission,storage,release,and sharing.As one of the cor...The Genome Warehouse(GWH)is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission,storage,release,and sharing.As one of the core resources in the National Genomics Data Center(NGDC),part of the China National Center for Bioinformation(CNCB;https://ngdc.cncb.ac.cn),GWH accepts both full and partial(chloroplast,mitochondrion,and plasmid)genome sequences with different assembly levels,as well as an update of existing genome assemblies.For each assembly,GWH collects detailed genome-related metadata of biological project,biological sample,and genome assembly,in addition to genome sequence and annotation.To archive high-quality genome sequences and annotations,GWH is equipped with a uniform and standardized procedure for quality control.Besides basic browse and search functionalities,all released genome sequences and annotations can be visualized with JBrowse.By May 21,2021,GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them.Collectively,GWH serves as an important resource for genomescale data management and provides free and publicly accessible data to support research activities throughout the world.GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.展开更多
On January 22,2020,China National Center for Bioinformation(CNCB)released the 2019 Novel Coronavirus Resource(2019nCoVR),an open-access information resource for the severe acute respiratory syndrome coronavirus 2(SARS...On January 22,2020,China National Center for Bioinformation(CNCB)released the 2019 Novel Coronavirus Resource(2019nCoVR),an open-access information resource for the severe acute respiratory syndrome coronavirus 2(SARS-CoV-2).2019nCoVR features a comprehensive integration of sequence and clinical information for all publicly available SARS-CoV-2 isolates,which are manually curated with value-added annotations and quality evaluated by an automated in-house pipeline.Of particular note,2019nCoVR offers systematic analyses to generate a dynamic landscape of SARS-CoV-2 genomic variations at a global scale.It provides all identified variants and their detailed statistics for each virus isolate,and congregates the quality score,functional annotation,and population frequency for each variant.Spatiotemporal change for each variant can be visualized and historical viral haplotype network maps for the course of the outbreak are also generated based on all complete and high-quality genomes available.Moreover,2019nCoVR provides a full collection of SARS-CoV-2 relevant literature on the coronavirus disease 2019(COVID-19),including published papers from PubMed as well as preprints from services such as bioRxiv and medRxiv through Europe PMC.Furthermore,by linking with relevant databases in CNCB,2019nCoVR offers data submission services for raw sequence reads and assembled genomes,and data sharing with NCBI.Collectively,SARS-CoV-2 is updated daily to collect the latest information on genome sequences,variants,haplotypes,and literature for a timely reflection,making 2019nCoVR a valuable resource for the global research community.2019nCoVR is accessible at https://bigd.big.ac.cn/ncov/.展开更多
A novel RNA virus,the severe acute respiratory syndrome coronavirus 2(SARS-CoV-2),is responsible for the ongoing outbreak of coronavirus disease 2019(COVID-19).Population genetic analysis could be useful for investiga...A novel RNA virus,the severe acute respiratory syndrome coronavirus 2(SARS-CoV-2),is responsible for the ongoing outbreak of coronavirus disease 2019(COVID-19).Population genetic analysis could be useful for investigating the origin and evolutionary dynamics of COVID-19.However,due to extensive sampling bias and existence of infection clusters during the epidemic spread,direct applications of existing approaches can lead to biased parameter estimations and data misinterpretation.In this study,we first present robust estimator for the time to the most recent common ancestor(TMRCA)and the mutation rate,and then apply the approach to analyze 12,909 genomic sequences of SARS-CoV-2.The mutation rate is inferred to be 8.69×10^(−4) per site per year with a 95%confidence interval(CI)of[8.61×10^(−4),8.77×10^(−4)],and the TMRCA of the samples inferred to be Nov 28,2019 with a 95%CI of[Oct 20,2019,Dec 9,2019].The results indicate that COVID-19 might originate earlier than and outside of Wuhan Seafood Market.We further demonstrate that genetic polymorphism patterns,including the enrichment of specific haplotypes and the temporal allele frequency trajectories generated from infection clusters,are similar to those caused by evolutionary forces such as natural selection.Our results show that population genetic methods need to be developed to efficiently detangle the effects of sampling bias and infection clusters to gain insights into the evolutionary mechanism of SARS-CoV-2.Software for implementing VirusMuT can be downloaded at https://bigd.big.ac.cn/biocode/tools/BT007081.展开更多
Data and their tailored characteristics are inheritable and longlived,surpassing their analyzed results and conclusions regardless if they are produced by their generators or users.Aside from designing experiments for...Data and their tailored characteristics are inheritable and longlived,surpassing their analyzed results and conclusions regardless if they are produced by their generators or users.Aside from designing experiments for the new acquisition,scientific researchers always begin with a thorough synthesis of the existing data,especially those that have been demonstrated authentic and timely.展开更多
A new variant of concern for SARS-CoV-2,Omicron(B.1.1.529),was designated by the World Health Organization on November 26,2021.This study analyzed the viral genome sequencing data of 108 samples collected from patient...A new variant of concern for SARS-CoV-2,Omicron(B.1.1.529),was designated by the World Health Organization on November 26,2021.This study analyzed the viral genome sequencing data of 108 samples collected from patients infected with Omicron.First,we found that the enrichment efficiency of viral nucleic acids was reduced due to mutations in the region where the primers anneal to.Second,the Omicron variant possesses an excessive number of mutations compared to other variants circulating at the same time(median:62 vs.45),especially in the Spike gene.Mutations in the Spike gene confer alterations in 32 amino acid residues,more than those observed in other SARS-CoV-2 variants.Moreover,a large number of nonsynonymous mutations occur in the codons for the amino acid residues located on the surface of the Spike protein,which could potentially affect the replication,infectivity,and antigenicity of SARS-CoV-2.Third,there are 53 mutations between the Omicron variant and its closest sequences available in public databases.Many of these mutations were rarely observed in public databases and had a low mutation rate.In addition,the linkage disequilibrium between these mutations was low,with a limited number of mutations concurrently observed in the same genome,suggesting that the Omicron variant would be in a different evolutionary branch from the currently prevalent variants.To improve our ability to detect and track the source of new variants rapidly,it is imperative to further strengthen genomic surveillance and data sharing globally in a timely manner.展开更多
COVID-19 has swept globally and Pakistan is no exception.To investigate the initial introductions and transmissions of the SARS-CoV-2 in Pakistan,we performed the largest genomic epidemiology study of COVID-19 in Paki...COVID-19 has swept globally and Pakistan is no exception.To investigate the initial introductions and transmissions of the SARS-CoV-2 in Pakistan,we performed the largest genomic epidemiology study of COVID-19 in Pakistan and generated 150 complete SARS-CoV-2 genome sequences from samples collected from March 16 to June 1,2020.We identified a total of 347 mutated positions,31 of which were over-represented in Pakistan.Meanwhile,we found over 1000 intra-host single-nucleotide variants(iSNVs).Several of them occurred concurrently,indicating possible interactions among them or coevolution.Some of the high-frequency iSNVs in Pakistan were not observed in the global population,suggesting strong purifying selections.The genomic epidemiology revealed five distinctive spreading clusters.The largest cluster consisted of 74 viruses which were derived from different geographic locations of Pakistan and formed a deep hierarchical structure,indicating an extensive and persistent nation-wide transmission of the virus that was probably attributed to a signature mutation(G8371T in ORF1ab)of this cluster.Furthermore,28 putative international introductions were identified,several of which are consistent with the epidemiological investigations.In all,this study has inferred the possible pathways of introductions and transmissions of SARS-CoV-2 in Pakistan,which could aid ongoing and future viral surveillance and COVID-19 control.展开更多
Monkeypox is a viral zoonotic disease endemic in Central and West Africa.Since January 1,2022,3413 laboratory-confirmed monkeypox cases and one death have been reported from 50 countries/territories in five WHO region...Monkeypox is a viral zoonotic disease endemic in Central and West Africa.Since January 1,2022,3413 laboratory-confirmed monkeypox cases and one death have been reported from 50 countries/territories in five WHO regions(as of June 22,2022;https://www.who.int/emergencies/disease-outbreak-news/item/2022-DON396),and 1310 new cases and eight new countries have been reported in the past week.Genomic epidemiology is vital to determine the similarity between viruses and suggest possible links between cases,origins of infection,and transmission dynamics when combined with epidemiological information.However,one of the priority evidence gaps relating to the monkeypox outbreak is genome sequencing and in-host variation analysis.1 So,timely sharing both raw sequence data and consensus genomic data are useful to public health investigators and academic partners undertaking related studies.展开更多
SARS-CoV-2 is a new RNA virus affecting humans and spreads extensively throughout the world since its first outbreak in December,2019.Whether the transmissibility and pathogenicity of SARS-CoV-2 in humans after zoonot...SARS-CoV-2 is a new RNA virus affecting humans and spreads extensively throughout the world since its first outbreak in December,2019.Whether the transmissibility and pathogenicity of SARS-CoV-2 in humans after zoonotic transfer are actively evolving,and driven by adaptation to the new host and environments is still under debate.Understanding the evolutionary mechanism underlying epidemiological and pathological characteristics of COVID-19 is essential for predicting the epidemic trend,and providing guidance for disease control and treatments.Interrogating novel strategies for identifying natural selection using within-species polymorphisms and 3,674,076 SARSCoV-2 genome sequences of 169 countries as of December 30,2021,we demonstrate with population genetic evidence that during the course of SARS-CoV-2 pandemic in humans,1)SARS-CoV-2 genomes are overall conserved under purifying selection,especially for the 14 genes related to viral RNA replication,transcription,and assembly;2)ongoing positive selection is actively driving the evolution of 6 genes(e.g.,S,ORF3a,and N)that play critical roles in molecular processes involving pathogen–host interactions,including viral invasion into and egress from host cells,and viral inhibition and evasion of host immune response,possibly leading to high transmissibility and mild symptom in SARS-CoV-2 evolution.According to an established haplotype phylogenetic relationship of 138 viral clusters,a spatial and temporal landscape of 556 critical mutations is constructed based on their divergence among viral haplotype clusters or repeatedly increase in frequency within at least 2 clusters,of which multiple mutations potentially conferring alterations in viral transmissibility,pathogenicity,and virulence of SARS-CoV-2 are highlighted,warranting attention.展开更多
The rapid advancement of sequencing technologies poses challenges in managing the large volume and exponential growth of sequence data efficiently and on time.To address this issue,we present GenBase(https://ngdc.cncb...The rapid advancement of sequencing technologies poses challenges in managing the large volume and exponential growth of sequence data efficiently and on time.To address this issue,we present GenBase(https://ngdc.cncb.ac.cn/genbase),an open-access data repository that follows the International Nucleotide Sequence Database Collaboration(INSDC)data standards and structures,for efficient nucleotide sequence archiving,searching,and sharing.As a core resource within the National Genomics Data Center(NGDC)of the China National Center for Bioinformation(CNCB;https://ngdc.cncb.ac.cn),GenBase offers bilingual submission pipeline and services,as well as local submission assistance in China.GenBase also provides a unique Excel format for metadata description and feature annotation of nucleotide sequences,along with a real-time data validation system to streamline sequence submissions.As of April 23,2024,GenBase received 68,251 nucleotide sequences and 689,574 annotated protein sequences across 414 species from 2319 submissions.Out of these,63,614(93%)nucleotide sequences and 620,640(90%)annotated protein sequences have been released and are publicly accessible through GenBase’s web search system,File Transfer Protocol(FTP),and Application Programming Interface(API).Additionally,in collaboration with INSDC,GenBase has constructed an effective data exchange mechanism with GenBank and started sharing released nucleotide sequences.Furthermore,GenBase integrates all sequences from GenBank with daily updates,demonstrating its commitment to actively contributing to global sequence data management and sharing.展开更多
基金supported by Technological Innovation 2030 (2022ZD0401701)National Natural Science Foundation of China (32000475,32030021)+1 种基金Strategic Priority Research Program of the Chinese Academy of Sciences (XDA24040201)Youth Innovation Promotion Association of the Chinese Academy of Sciences (Y2021038).
文摘Genomic data serve as an invaluable resource for unraveling the intricacies of the higher plant systems,including the constituent elements within and among species.Through various efforts in genomic data archiving,integrative analysis and value-added curation,the National Genomics Data Center(NGDC),which is a part of the China National Center for Bioinformation(CNCB),has successfully established and currently maintains a vast amount of database resources.This dedicated initiative of the NGDC facilitates a data-rich ecosystem that greatly strengthens and supports genomic research efforts.Here,we present a comprehensive overview of central repositories dedicated to archiving,presenting,and sharing plant omics data,introduce knowledgebases focused on variants or gene-based functional insights,highlight species-specific multiple omics database resources,and briefly review the online application tools.We intend that this review can be used as a guide map for plant researchers wishing to select effective data resources from the NGDC for their specific areas of study.
基金supported by grants from the Strategic Priority Research Program of the Chinese Academy of Sciences(Grant Nos.XDA19090116 and XDA19050302)the National Natural Science Foundation of China(Grant Nos.31871328 and 32030021)+2 种基金the Professional Association of the Alliance of International Science Organizations(Grant No.ANSO-PA-2020-07)the Youth Innovation Promotion Association of Chinese Academy of Sciences(Grant No.2019104)the International Partnership Program of the Chinese Academy of Sciences(Grant No.153F11KYSB20160008).
文摘Biological databases serve as a global fundamental infrastructure for the worldwide scientific community,which dramatically aid the transformation of big data into knowledge discovery and drive significant innovations in a wide range of research fields.Given the rapid data production,biological databases continue to increase in size and importance.To build a catalog of worldwide biological databases,we curate a total of 5825 biological databases from 8931 publications,which are geographically distributed in 72 countries/regions and developed by 1975 institutions(as of September 20,2022).We further devise a z-index,a novel index to characterize the scientific impact of a database,and rank all these biological databases as well as their hosting institutions and countries in terms of citation and z-index.Consequently,we present a series of statistics and trends of worldwide biological databases,yielding a global perspective to better understand their status and impact for life and health sciences.An up-to-date catalog of worldwide biological databases,as well as their curated meta-information and derived statistics,is publicly available at Database Commons(https://ngdc.cncb.ac.cn/databasecommons/).
基金supported by Strategic Priority Research Program of the Chinese Academy of Sciences[XDB38030201,XDB38030400,XDB38050300]Youth Innovation Promotion Association of Chinese Academy of Sciences[2019104]。
文摘Genome data of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)is essential for virus diagnosis,vaccine development,and variant surveillance.To archive and integrate worldwide SARS-CoV-2 genome data,a series of resources have been constructed,serving as a fundamental infrastructure for SARS-CoV-2 research,pandemic prevention and control,and coronavirus disease 2019(COVID-19)therapy.Here we present an over-view of extant SARS-CoV-2 resources that are devoted to genome data deposition and integration.We review deposition resources in data accessibility,metadata standardization,data curation and annotation;review integrative resources in data source,de-redundancy processing,data curation and quality assessment,and variant annotation.Moreover,we address issues that impede SARS-CoV-2 genome data integration,including low-complexity,inconsistency and absence of isolate name,sequence inconsistency,asynchronous update of genome data,and mismatched metadata.We finally provide insights into data standardization consensus and data submission guidelines,to promote SARS-CoV-2 genome data sharing and integration.
基金supported by grants from the National Key R&D Program of China(Grant Nos.2023YFC3041500 and 2021YFF0703703)the Key Collaborative Research Program of the Alliance of International Science Organizations(Grant No.ANSO-CR-KP-2022-09)+2 种基金the National Natural Science Foundation of China(Grant No.32270718)the Beijing Nova Program(Grant No.Z211100002121006)the Youth Innovation Promotion Association of the Chinese Academy of Sciences(Grant Nos.Y2021038 and 2019104),China.
文摘The Resource for Coronavirus 2019(RCoV19)is an open-access information resource dedicated to providing valuable data on the genomes,mutations,and variants of the severe acute respiratory syndrome coronavirus 2(SARS-CoV-2).In this updated implementation of RCoV19,we have made significant improvements and advancements over the previous version.Firstly,we have implemented a highly refined genome data curation model.This model now features an automated integration pipeline and optimized curation rules,enabling efficient daily updates of data in RCoV19.Secondly,we have developed a global and regional lineage evolution monitoring platform,alongside an outbreak risk pre-warning system.These additions provide a comprehensive understanding of SARS-CoV-2 evolution and transmission patterns,enabling better preparedness and response strategies.Thirdly,we have developed a powerful interactive mutation spectrum comparison module.This module allows users to compare and analyze mutation patterns,assisting in the detection of potential new lineages.Furthermore,we have incorporated a comprehensive knowledgebase on mutation effects.This knowledgebase serves as a valuable resource for retrieving information on the functional implications of specific mutations.In summary,RCoV19 serves as a vital scientific resource,providing access to valuable data,relevant information,and technical support in the global fight against COVID-19.The complete contents of RCoV19 are available to the public at https://ngdc.cncb.ac.cn/ncov/.
基金funded by the“Strategic Priority Research Program”of CAS(Grant No.XDB38030200)the Open Biodiversity and Health Big Data Programme of International Union of Biological Sciences awarded to YB.
文摘Background Big data challenges In the late 1980s and early 1990s,three major international biological data centers were created:the DNA Database of Japan(DDBJ)[1],the European Bioinformatics Institute(EMBL-EBI)in the United Kingdom(UK)[2],and the National Center for Biotechnology Information(NCBI)in the United States(US)[3].
基金supported by grants from National Key R&D Program of China(Grant No.2017YFC0907502 to ZZ)Strategic Priority Research Program of Chinese Academy of Sciences(Grant Nos.XDB38060100 and XDB38030200 to YB+13 种基金XDB38050300 to WZXDB38030400 to JXXDA19050302 to ZZ)National Key R&D Program of China(Grant Nos.2016YFC0901603 to WZ2017YFC1201202 to YW2020YFC0847000 and 2018YFD1000505 to WZ2016YFE0206600 to YB)The 13th Five-year Informatization Plan of Chinese Academy of Sciences(Grant No.XXH13505-05 to YB)Genomics Data Center Construction of Chinese Academy of Sciences(Grant No.XXH-13514-0202 to YB)Open Biodiversity and Health Big Data Programme of the International Union of Biological Sciences to YBThe Professional Association of the Alliance of International Science Organizations(Grant No.ANSO-PA-2020-07 to YB)National Natural Science Foundation of China(Grant Nos.32030021 and 31871328 to ZZ)International Partnership Program of the Chinese Academy of Sciences(Grant No.153F11KYSB20160008 to ZZ)。
文摘The Genome Sequence Archive(GSA)is a data repository for archiving raw sequence data,which provides data storage and sharing services for worldwide scientific communities.Considering explosive data growth with diverse data types,here we present the GSA family by expanding into a set of resources for raw data archive with different purposes,namely,GSA(https://ngdc.cncb.ac.cn/gsa/),GSA for Human(GSA-Human,https://ngdc.cncb.ac.cn/gsa-human/),and Open Archive for Miscellaneous Data(OMIX,https://ngdc.cncb.ac.cn/omix/).Compared with the 2017 version,GSA has been significantly updated in data model,online functionalities,and web interfaces.GSA-Human,as a new partner of GSA,is a data repository specialized in human genetics-related data with controlled access and security.OMIX,as a critical complement to the two resources mentioned above,is an open archive for miscellaneous data.Together,all these resources form a family of resources dedicated to archiving explosive data with diverse types,accepting data submissions from all over the world,and providing free open access to all publicly available data in support of worldwide research activities.
基金supported by the Strategic Priority Research Program of Chinese Academy of Sciences(Grant Nos.XDB38060100 and XDB38030200 to YBXDB38050300 to WZ+9 种基金XDB38030400 to JXXDA19050302 to ZZ)the National Key R&D Program of China(Grant Nos.2016YFE0206600 to YB2020YFC0847000,2018YFD1000505,2017YFC1201202,and 2016YFC0901603 to WZ2017YFC0907502 to ZZ)the 13th Five-year Informatization Plan of Chinese Academy of Sciences(Grant No.XXH13505-05 to YB)the Genomics Data Center Construction of Chinese Academy of Sciences(Grant No.XXH-13514-0202 to YB)the Open Biodiversity and Health Big Data Programme of International Union of Biological Sciences to YB,the Professional Association of the Alliance of International Science Organizations(Grant No.ANSO-PA-2020-07 to YB)the National Natural Science Foundation of China(Grant Nos.32030021 and 31871328 to ZZ)the International Partnership Program of the Chinese Academy of Sciences(Grant No.153F11KYSB20160008 to ZZ)。
文摘The Genome Warehouse(GWH)is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission,storage,release,and sharing.As one of the core resources in the National Genomics Data Center(NGDC),part of the China National Center for Bioinformation(CNCB;https://ngdc.cncb.ac.cn),GWH accepts both full and partial(chloroplast,mitochondrion,and plasmid)genome sequences with different assembly levels,as well as an update of existing genome assemblies.For each assembly,GWH collects detailed genome-related metadata of biological project,biological sample,and genome assembly,in addition to genome sequence and annotation.To archive high-quality genome sequences and annotations,GWH is equipped with a uniform and standardized procedure for quality control.Besides basic browse and search functionalities,all released genome sequences and annotations can be visualized with JBrowse.By May 21,2021,GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them.Collectively,GWH serves as an important resource for genomescale data management and provides free and publicly accessible data to support research activities throughout the world.GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.
基金This work was supported by grants from the Strategic PriorityResearch Program of Chinese Academy of Sciences(GrantNos.XDA19090116,XDA19050302,and XDB38030400)awarded to SS,ZZ,and MLthe National Key R&D Programof China(Grant Nos.2020YFC0848900,2020YFC0847000,2016YFE0206600,and 2017YFC0907502)+5 种基金the 13th Five-yearInformatization Plan of Chinese Academy of Sciences(GrantNo.XXH13505-05)Genomics Data Center Construction ofChinese Academy of Sciences(Grant No.XXH-13514-0202)the Open Biodiversity and Health Big Data Programme ofInternational Union of Biological Sciences,International Part-nership Program of Chinese Academy of Sciences(Grant No.153F11KYSB20160008)the Professional Association of theAlliance of International Science Organizations(Grant No.ANSO-PA-2020-07)This work was also supported by KCWong Education Foundation to ZZthe YouthInnovation Promotion Association of Chinese Academy ofSciences(Grant Nos.2017141 and 2019104)awarded to SSand ML.
文摘On January 22,2020,China National Center for Bioinformation(CNCB)released the 2019 Novel Coronavirus Resource(2019nCoVR),an open-access information resource for the severe acute respiratory syndrome coronavirus 2(SARS-CoV-2).2019nCoVR features a comprehensive integration of sequence and clinical information for all publicly available SARS-CoV-2 isolates,which are manually curated with value-added annotations and quality evaluated by an automated in-house pipeline.Of particular note,2019nCoVR offers systematic analyses to generate a dynamic landscape of SARS-CoV-2 genomic variations at a global scale.It provides all identified variants and their detailed statistics for each virus isolate,and congregates the quality score,functional annotation,and population frequency for each variant.Spatiotemporal change for each variant can be visualized and historical viral haplotype network maps for the course of the outbreak are also generated based on all complete and high-quality genomes available.Moreover,2019nCoVR provides a full collection of SARS-CoV-2 relevant literature on the coronavirus disease 2019(COVID-19),including published papers from PubMed as well as preprints from services such as bioRxiv and medRxiv through Europe PMC.Furthermore,by linking with relevant databases in CNCB,2019nCoVR offers data submission services for raw sequence reads and assembled genomes,and data sharing with NCBI.Collectively,SARS-CoV-2 is updated daily to collect the latest information on genome sequences,variants,haplotypes,and literature for a timely reflection,making 2019nCoVR a valuable resource for the global research community.2019nCoVR is accessible at https://bigd.big.ac.cn/ncov/.
基金This study was supported by the National Key R&D Program of China(Grant No.2020YFC0847000)the National Natural Science Foundation of China(Grant Nos.31571370,91731302,and 31772435).
文摘A novel RNA virus,the severe acute respiratory syndrome coronavirus 2(SARS-CoV-2),is responsible for the ongoing outbreak of coronavirus disease 2019(COVID-19).Population genetic analysis could be useful for investigating the origin and evolutionary dynamics of COVID-19.However,due to extensive sampling bias and existence of infection clusters during the epidemic spread,direct applications of existing approaches can lead to biased parameter estimations and data misinterpretation.In this study,we first present robust estimator for the time to the most recent common ancestor(TMRCA)and the mutation rate,and then apply the approach to analyze 12,909 genomic sequences of SARS-CoV-2.The mutation rate is inferred to be 8.69×10^(−4) per site per year with a 95%confidence interval(CI)of[8.61×10^(−4),8.77×10^(−4)],and the TMRCA of the samples inferred to be Nov 28,2019 with a 95%CI of[Oct 20,2019,Dec 9,2019].The results indicate that COVID-19 might originate earlier than and outside of Wuhan Seafood Market.We further demonstrate that genetic polymorphism patterns,including the enrichment of specific haplotypes and the temporal allele frequency trajectories generated from infection clusters,are similar to those caused by evolutionary forces such as natural selection.Our results show that population genetic methods need to be developed to efficiently detangle the effects of sampling bias and infection clusters to gain insights into the evolutionary mechanism of SARS-CoV-2.Software for implementing VirusMuT can be downloaded at https://bigd.big.ac.cn/biocode/tools/BT007081.
基金supported by grants from the Strategic Priority Research Program of the Chinese Academy of Sciences(Grant Nos.XDA19090116 and XDA19050302)National Key R&D Program of China(Grant No.2017YFC0907502)+2 种基金13th Five-year Informatization Plan of the Chinese Academy of Sciences(Grant No.XXH13505-05)Wong KC Education Foundation to ZZthe International Partnership Program of the Chinese Academy of Sciences(Grant No.153F11KYSB20160008)
文摘Data and their tailored characteristics are inheritable and longlived,surpassing their analyzed results and conclusions regardless if they are produced by their generators or users.Aside from designing experiments for the new acquisition,scientific researchers always begin with a thorough synthesis of the existing data,especially those that have been demonstrated authentic and timely.
基金funded by the National Natural Science Foundation of China(Grant No.82161148009)the Strategic Priority Research Program of Chinese Academy of Sciences(Grant No.XDB38030400)+2 种基金the Capital Health Development and Research Special Programme(Grant No.20211G-3012)the Conselho Nacional de Desenvolvimento Cientifico e Tecnológico(CNPq)-NGS-BRICS-n°:440931/2020-7the Russian Foundation for Basic Research(RFBR)(Grant No.20-54-80014)。
文摘A new variant of concern for SARS-CoV-2,Omicron(B.1.1.529),was designated by the World Health Organization on November 26,2021.This study analyzed the viral genome sequencing data of 108 samples collected from patients infected with Omicron.First,we found that the enrichment efficiency of viral nucleic acids was reduced due to mutations in the region where the primers anneal to.Second,the Omicron variant possesses an excessive number of mutations compared to other variants circulating at the same time(median:62 vs.45),especially in the Spike gene.Mutations in the Spike gene confer alterations in 32 amino acid residues,more than those observed in other SARS-CoV-2 variants.Moreover,a large number of nonsynonymous mutations occur in the codons for the amino acid residues located on the surface of the Spike protein,which could potentially affect the replication,infectivity,and antigenicity of SARS-CoV-2.Third,there are 53 mutations between the Omicron variant and its closest sequences available in public databases.Many of these mutations were rarely observed in public databases and had a low mutation rate.In addition,the linkage disequilibrium between these mutations was low,with a limited number of mutations concurrently observed in the same genome,suggesting that the Omicron variant would be in a different evolutionary branch from the currently prevalent variants.To improve our ability to detect and track the source of new variants rapidly,it is imperative to further strengthen genomic surveillance and data sharing globally in a timely manner.
基金supported by grants from the National Key R&D Program of China(Grant Nos.2021YFC0863300,2020YFC0848900,and 2016YFE0206600)the National Natural Science Foundation of China(Grant No.82161148009)+3 种基金the Strategic Priority Research Program of Chinese Academy of Sciences,China(Grant Nos.XDA19090116 and XDB38060100)the Open Biodiversity and Health Big Data Programme of International Union of Biological Sciences,International Partnership Program of Chinese Academy of Sciences(Grant No.153F11KYSB20160008)the Professional Association of the Alliance of International Science Organizations(Grant No.ANSO-PA-2020-07)the Youth Innovation Promotion Association of Chinese Academy of Sciences(Grant No.2017141)。
文摘COVID-19 has swept globally and Pakistan is no exception.To investigate the initial introductions and transmissions of the SARS-CoV-2 in Pakistan,we performed the largest genomic epidemiology study of COVID-19 in Pakistan and generated 150 complete SARS-CoV-2 genome sequences from samples collected from March 16 to June 1,2020.We identified a total of 347 mutated positions,31 of which were over-represented in Pakistan.Meanwhile,we found over 1000 intra-host single-nucleotide variants(iSNVs).Several of them occurred concurrently,indicating possible interactions among them or coevolution.Some of the high-frequency iSNVs in Pakistan were not observed in the global population,suggesting strong purifying selections.The genomic epidemiology revealed five distinctive spreading clusters.The largest cluster consisted of 74 viruses which were derived from different geographic locations of Pakistan and formed a deep hierarchical structure,indicating an extensive and persistent nation-wide transmission of the virus that was probably attributed to a signature mutation(G8371T in ORF1ab)of this cluster.Furthermore,28 putative international introductions were identified,several of which are consistent with the epidemiological investigations.In all,this study has inferred the possible pathways of introductions and transmissions of SARS-CoV-2 in Pakistan,which could aid ongoing and future viral surveillance and COVID-19 control.
基金funded by grant from the Youth Innovation Promotion Association of CAS(Y2021038 to S.S.).
文摘Monkeypox is a viral zoonotic disease endemic in Central and West Africa.Since January 1,2022,3413 laboratory-confirmed monkeypox cases and one death have been reported from 50 countries/territories in five WHO regions(as of June 22,2022;https://www.who.int/emergencies/disease-outbreak-news/item/2022-DON396),and 1310 new cases and eight new countries have been reported in the past week.Genomic epidemiology is vital to determine the similarity between viruses and suggest possible links between cases,origins of infection,and transmission dynamics when combined with epidemiological information.However,one of the priority evidence gaps relating to the monkeypox outbreak is genome sequencing and in-host variation analysis.1 So,timely sharing both raw sequence data and consensus genomic data are useful to public health investigators and academic partners undertaking related studies.
基金supported by the National Key R&D Program of China(Grant Nos.2021YFC0863400,2021YFC2301305,2020YFC0847000,2018YFC1406902,and 2018YFC0910402)the Key Program of Chinese Academy of Sciences(Grant No.KJZD-SW-L14)+2 种基金the National Natural Science Foundation of China(Grant Nos.31571370,91731302,and 91631106)the Shanghai Municipal Science and Technology Major Project,China(Grant No.2017SHZDZX01)the Strategic Priority Research Program of the Chinese Academy of Sciences,China(Grant Nos.XDPB17 and XDB38040200).
文摘SARS-CoV-2 is a new RNA virus affecting humans and spreads extensively throughout the world since its first outbreak in December,2019.Whether the transmissibility and pathogenicity of SARS-CoV-2 in humans after zoonotic transfer are actively evolving,and driven by adaptation to the new host and environments is still under debate.Understanding the evolutionary mechanism underlying epidemiological and pathological characteristics of COVID-19 is essential for predicting the epidemic trend,and providing guidance for disease control and treatments.Interrogating novel strategies for identifying natural selection using within-species polymorphisms and 3,674,076 SARSCoV-2 genome sequences of 169 countries as of December 30,2021,we demonstrate with population genetic evidence that during the course of SARS-CoV-2 pandemic in humans,1)SARS-CoV-2 genomes are overall conserved under purifying selection,especially for the 14 genes related to viral RNA replication,transcription,and assembly;2)ongoing positive selection is actively driving the evolution of 6 genes(e.g.,S,ORF3a,and N)that play critical roles in molecular processes involving pathogen–host interactions,including viral invasion into and egress from host cells,and viral inhibition and evasion of host immune response,possibly leading to high transmissibility and mild symptom in SARS-CoV-2 evolution.According to an established haplotype phylogenetic relationship of 138 viral clusters,a spatial and temporal landscape of 556 critical mutations is constructed based on their divergence among viral haplotype clusters or repeatedly increase in frequency within at least 2 clusters,of which multiple mutations potentially conferring alterations in viral transmissibility,pathogenicity,and virulence of SARS-CoV-2 are highlighted,warranting attention.
基金supported by the Strategic Priority Research Program of the Chinese Academy of Sciences(Grant No.XDB38030200)the National Key R&D Program of China(Grant No.2021YFF0703701)+2 种基金the Professional Association of the Alliance of International Science Organizations(Grant No.ANSO-PA-2023-07)the International Partnership Program of the Chinese Academy of Sciences(Grant No.161GJHZ2022002MI)the Open Biodiversity and Health Big Data Initiative of International Union of Biological Sciences(IUBS).
文摘The rapid advancement of sequencing technologies poses challenges in managing the large volume and exponential growth of sequence data efficiently and on time.To address this issue,we present GenBase(https://ngdc.cncb.ac.cn/genbase),an open-access data repository that follows the International Nucleotide Sequence Database Collaboration(INSDC)data standards and structures,for efficient nucleotide sequence archiving,searching,and sharing.As a core resource within the National Genomics Data Center(NGDC)of the China National Center for Bioinformation(CNCB;https://ngdc.cncb.ac.cn),GenBase offers bilingual submission pipeline and services,as well as local submission assistance in China.GenBase also provides a unique Excel format for metadata description and feature annotation of nucleotide sequences,along with a real-time data validation system to streamline sequence submissions.As of April 23,2024,GenBase received 68,251 nucleotide sequences and 689,574 annotated protein sequences across 414 species from 2319 submissions.Out of these,63,614(93%)nucleotide sequences and 620,640(90%)annotated protein sequences have been released and are publicly accessible through GenBase’s web search system,File Transfer Protocol(FTP),and Application Programming Interface(API).Additionally,in collaboration with INSDC,GenBase has constructed an effective data exchange mechanism with GenBank and started sharing released nucleotide sequences.Furthermore,GenBase integrates all sequences from GenBank with daily updates,demonstrating its commitment to actively contributing to global sequence data management and sharing.