The Chinese tree shrew(Tupaia belangeri chinensis)is emerging as an important experimental animal in multiple fields of biomedical research.Comprehensive reference genome annotation for both mRNA and long non-coding R...The Chinese tree shrew(Tupaia belangeri chinensis)is emerging as an important experimental animal in multiple fields of biomedical research.Comprehensive reference genome annotation for both mRNA and long non-coding RNA(lncRNA)is crucial for developing animal models using this species.In the current study,we collected a total of 234 high-quality RNA sequencing(RNA-seq)datasets and two long-read isoform sequencing(ISO-seq)datasets and improved the annotation of our previously assembled high-quality chromosomelevel tree shrew genome.We obtained a total of 3514 newly annotated coding genes and 50576 lncRNA genes.We also characterized the tissuespecific expression patterns and alternative splicing patterns of mRNAs and lncRNAs and mapped the orthologous relationships among 11 mammalian species using the current annotated genome.We identified 144 tree shrew-specific gene families,including interleukin 6(IL6)and STT3 oligosaccharyltransferase complex catalytic subunit B(STT3B),which underwent significant changes in size.Comparison of the overall expression patterns in tissues and pathways across four species(human,rhesus monkey,tree shrew,and mouse)indicated that tree shrews are more similar to primates than to mice at the tissue-transcriptome level.Notably,the newly annotated purine rich element binding protein A(PURA)gene and the STT3B gene family showed dysregulation upon viral infection.The updated version of the tree shrew genome annotation(KIZ version 3:TS_3.0)is available at http://www.treeshrewdb.org and provides an essential reference for basic and biomedical studies using tree shrew animal models.展开更多
Diatoms comprise a diverse and ecologically important group of eukaryotic phytoplankton that signifi- cantly contributes to marine primary production and global carbon cycling. Phaeodactylum tricornutum is commonly us...Diatoms comprise a diverse and ecologically important group of eukaryotic phytoplankton that signifi- cantly contributes to marine primary production and global carbon cycling. Phaeodactylum tricornutum is commonly used as a model organism for studying diatom biology. Although its genome was sequenced in 2008, a high-quality genome annotation is still not available for this diatom. Here we report the develop- ment of an integrated proteogenomic pipeline and its application for improved annotation of P. tricornutum genome using mass spectrometry (MS)-based proteomics data. Our proteogenomic analysis unambigu- ously identified approximately 8300 genes and revealed 606 novel proteins, 506 revised genes, 94 splice variants, 58 single amino acid variants, and a holistic view of post-translational modifications in P. tricor- nutum. We experimentally confirmed a subset of novel events and obtained MS evidence for more than 200 micropeptides in P. tricornutum. These findings expand the genomic landscape of P. tricornutum and provide a rich resource for the study of diatom biology. The proteogenomic pipeline we developed in this study is applicable to any sequenced eukaryote and thus represents a significant contribution to the toolset for eukaryotic proteogenomic analysis. The pipeline and its source code are freely available at https://sourceforge.net/projects/gapeproteogeno mic.展开更多
The kuruma prawn, Marsupenaeus japonicus, is one of the most cultivated and consumed species of shrimp. However, very few molecular genetic/genomic resources are publically available for it. Thus, the characterization...The kuruma prawn, Marsupenaeus japonicus, is one of the most cultivated and consumed species of shrimp. However, very few molecular genetic/genomic resources are publically available for it. Thus, the characterization and distribution of simple sequence repeats(SSRs) remains ambiguous and the use of SSR markers in genomic studies and marker-assisted selection is limited. The goal of this study is to characterize and develop genome-wide SSR markers in M. japonicus by genome survey sequencing for application in comparative genomics and breeding. A total of 326 945 perfect SSRs were identified, among which dinucleotide repeats were the most frequent class(44.08%), followed by mononucleotides(29.67%), trinucleotides(18.96%), tetranucleotides(5.66%), hexanucleotides(1.07%), and pentanucleotides(0.56%). In total, 151 541 SSR loci primers were successfully designed. A subset of 30 SSR primer pairs were synthesized and tested in 42 individuals from a wild population, of which 27 loci(90.0%) were successfully amplified with specific products and 24(80.0%) were polymorphic. For the amplified polymorphic loci, the alleles ranged from 5 to 17(with an average of 9.63), and the average PIC value was 0.796. A total of 58 256 SSR-containing sequences had significant Gene Ontology annotation; these are good functional molecular marker candidates for association studies and comparative genomic analysis. The newly identified SSRs significantly contribute to the M. japonicus genomic resources and will facilitate a number of genetic and genomic studies, including high density linkage mapping, genome-wide association analysis, marker-aided selection, comparative genomics analysis, population genetics, and evolution.展开更多
Pear is an important fruit tree that is widely distributed around the world.The first pear genome map was reported from our laboratory approximately 10 years ago.To further study global protein expression patterns in ...Pear is an important fruit tree that is widely distributed around the world.The first pear genome map was reported from our laboratory approximately 10 years ago.To further study global protein expression patterns in pear,we generated pear proteome data based on 24 major tissues.The tissue-resolved profiles provided evidence of the expression of 17953 proteins.We identified 4294 new coding events and improved the pear genome annotation via the proteogenomic strategy based on 18090 peptide spectra with peptide spectrum matches>1.Among the eight randomly selected new short coding open reading frames that were expressed in the style,four promoted and one inhibited the growth of pear pollen tubes.Based on gene coexpression module analysis,we explored the key genes associated with important agronomic traits,such as stone cell formation in fruits.The network regulating the synthesis of lignin,a major component of stone cells,was reconstructed,and receptor-like kinases were implicated as core factors in this regulatory network.Moreover,we constructed the online database PearEXP(http://www.peardb.org.cn)to enable access to the pear proteogenomic resources.This study provides a paradigm for in-depth proteogenomic studies of woody plants.展开更多
Miiuy croaker,Miichthys miiuy is an ecologically important teleost species which is widely distributed in southeast coast of China.In this study,we present a chromosomal-scale genome assembly of the miiuy croaker whic...Miiuy croaker,Miichthys miiuy is an ecologically important teleost species which is widely distributed in southeast coast of China.In this study,we present a chromosomal-scale genome assembly of the miiuy croaker which is an important Sciaenidae fish and economical species.We adopted Oxford Nanopore and Hi-C sequencing techniques to achieve an assembly with high accuracy and completeness.The investigation of genome characteristic and functional features may provide insights into the study of phylogenetic diversity of miiuy croaker.This study can also be beneficial to improve molecular assisted breeding techniques.Moreover,it can be a great resource to better conduct further study of other sciaenids.展开更多
The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences c...The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of versiondependent annotation files and other compatible public dataset for downstream analysis. To handlethese tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool(RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNAseq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool(RATT),RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification,genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/and https://github.com/wushyer/RGAAT;2 at no cost.展开更多
The mechanism of calcium uptake, translocation and accumulation in Poaceae has not yet been fully understood. To address this issue, we conducted genome-wide comparative in silico analysis of the calcium (Ca2+) tra...The mechanism of calcium uptake, translocation and accumulation in Poaceae has not yet been fully understood. To address this issue, we conducted genome-wide comparative in silico analysis of the calcium (Ca2+) transporter gene family of two crop species, rice and sorghum. Gene annotation, identification of upstream cis-acting ele- ments, phylogenetic tree construction and syntenic mapping of the gene family were performed using several bio- informatics tools. A total of 31 Ca2+ transporters, distributed on 9 out of 12 chromosomes, were predicted from rice genome, while 28 Ca2+ transporters predicted from sorghum are distributed on all the chromosomes except chromosome 10 (Chr 10). Interestingly, most of the genes on Chr 1 and Chr 3 show an inverse syntenic relation- ship between rice and sorghum. Multiple sequence alignment and motif analysis of these transporter proteins re- vealed high conservation between the two species. Phylogenetic tree could very well identify the subclasses of channels, ATPases and exchangers among the gene family. The in silico c/s-regulatory element analysis suggested diverse functions associated with light, stress and hormone responsiveness as well as endosperm- and meris- tem-specific gene expression. Further experiments are warranted to validate the in silico analysis of the predicted transporter gene family and elucidate the functions of Ca2+ transporters in various biological processes.展开更多
The Genome Warehouse(GWH)is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission,storage,release,and sharing.As one of the cor...The Genome Warehouse(GWH)is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission,storage,release,and sharing.As one of the core resources in the National Genomics Data Center(NGDC),part of the China National Center for Bioinformation(CNCB;https://ngdc.cncb.ac.cn),GWH accepts both full and partial(chloroplast,mitochondrion,and plasmid)genome sequences with different assembly levels,as well as an update of existing genome assemblies.For each assembly,GWH collects detailed genome-related metadata of biological project,biological sample,and genome assembly,in addition to genome sequence and annotation.To archive high-quality genome sequences and annotations,GWH is equipped with a uniform and standardized procedure for quality control.Besides basic browse and search functionalities,all released genome sequences and annotations can be visualized with JBrowse.By May 21,2021,GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them.Collectively,GWH serves as an important resource for genomescale data management and provides free and publicly accessible data to support research activities throughout the world.GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.展开更多
The blackspotted croaker(Protonibea diacanthus)is an endangered coastal marine fish.It is also a valuable species that is cultured on the southeast coast of China.While some genetic studies have been conducted to prot...The blackspotted croaker(Protonibea diacanthus)is an endangered coastal marine fish.It is also a valuable species that is cultured on the southeast coast of China.While some genetic studies have been conducted to protect this species,genomic resources are lacking.Here,we report a chromosome-scale assembly of P.diacanthus genome by high-depth genome sequencing,assembly,and annotation.The genome scale was 635.69 Mb with contig and scaffold N50 length of 3.33 Mb and 25.60 Mb,respectively.Hi-C scaffolding of the genome resulted in 24 chromosomes of 94.15%total genome.We predicted 23,971 protein-coding genes.In addition,we constructed a phylogenetic tree using 2755 single-copy gene families and identified 462 unique gene families in P.diacanthus genome compared to three other sciaenids.What’s more,from the analysis of gene families,we found that several gene families related to innate immunity were significantly expanded in the blackspotted croaker genome compared to other teleost genomes.The high-quality genome can improve our understanding of the molecular mechanisms behind economically valuable traits and provide insights into characteristics of the immune system.展开更多
Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SA...Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SARS-CoV genome sequences by using annotation programs publicly available or developed by ourselves. Totally, 21 open reading frames (ORFs) of genes or putative uncharacterized proteins (PUPs) were predicted. Seven PUPs had not been reported previously, and two of them were predicted to contain transmembrane regions. Eight ORFs partially overlapped with or embedded into those of known genes, revealing that the SARS-CoV genome is a small and compact one with overlapped coding regions. The most striking discovery is that an ORF locates on the minus strand. We have also annotated non-coding regions and identified the transcription regulating sequences (TRS) in the intergenic regions. The analysis of TRS supports the minus strand extending transcription mechanism of coronavirus. The SNP analysis of different isolates reveals that mutations of the sequences do not affect the prediction results of ORFs.展开更多
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low le...Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.展开更多
The KEGG pathway maps are widely used as a reference data set for inferring high-level functions of the organism or the ecosystem from its genome or metagenome sequence data. The KEGG modules, which are tighter functi...The KEGG pathway maps are widely used as a reference data set for inferring high-level functions of the organism or the ecosystem from its genome or metagenome sequence data. The KEGG modules, which are tighter functional units often corresponding to subpathways in the KEGG pathway maps, are designed for better automation of genome interpretation. Each KEGG module is represented by a simple Boolean expression of KEGG Orthology (KO) identifiers (K numbers), enabling automatic evaluation of the completeness of genes in the genome. Here we focus on metabolic functions and introduce reaction modules for improving annotation and signature modules for inferring metabolic capacity. We also describe how genome annotation is performed in KEGG using the manually created KO database and the computationaUy generated SSDB database. The resulting KEGG GENES database with KO (K number) annotation is a reference sequence database to be compared for automated annotation and interpretation of newly determined genomes.展开更多
基金This study was supported by the National Natural Science Foundation of China(U1902215 to Y.G.Y.and 31970542 to Y.F.)Chinese Academy of Sciences(Light of West China Program xbzg-zdsys-201909 to Y.G.Y.)Yunnan Province(202001AS070023 and 2018FB046 to D.D.Y.and 202002AA100007 to Y.G.Y.)。
文摘The Chinese tree shrew(Tupaia belangeri chinensis)is emerging as an important experimental animal in multiple fields of biomedical research.Comprehensive reference genome annotation for both mRNA and long non-coding RNA(lncRNA)is crucial for developing animal models using this species.In the current study,we collected a total of 234 high-quality RNA sequencing(RNA-seq)datasets and two long-read isoform sequencing(ISO-seq)datasets and improved the annotation of our previously assembled high-quality chromosomelevel tree shrew genome.We obtained a total of 3514 newly annotated coding genes and 50576 lncRNA genes.We also characterized the tissuespecific expression patterns and alternative splicing patterns of mRNAs and lncRNAs and mapped the orthologous relationships among 11 mammalian species using the current annotated genome.We identified 144 tree shrew-specific gene families,including interleukin 6(IL6)and STT3 oligosaccharyltransferase complex catalytic subunit B(STT3B),which underwent significant changes in size.Comparison of the overall expression patterns in tissues and pathways across four species(human,rhesus monkey,tree shrew,and mouse)indicated that tree shrews are more similar to primates than to mice at the tissue-transcriptome level.Notably,the newly annotated purine rich element binding protein A(PURA)gene and the STT3B gene family showed dysregulation upon viral infection.The updated version of the tree shrew genome annotation(KIZ version 3:TS_3.0)is available at http://www.treeshrewdb.org and provides an essential reference for basic and biomedical studies using tree shrew animal models.
基金This work was supported by the National Key Research and Development Program (2016YFA0501304), the National Natural Science Foundation of China (grant no. 31570829), and the Strategic Priority Research Program of the Chinese Academy of Sciences (grant no. XDB14030202).
文摘Diatoms comprise a diverse and ecologically important group of eukaryotic phytoplankton that signifi- cantly contributes to marine primary production and global carbon cycling. Phaeodactylum tricornutum is commonly used as a model organism for studying diatom biology. Although its genome was sequenced in 2008, a high-quality genome annotation is still not available for this diatom. Here we report the develop- ment of an integrated proteogenomic pipeline and its application for improved annotation of P. tricornutum genome using mass spectrometry (MS)-based proteomics data. Our proteogenomic analysis unambigu- ously identified approximately 8300 genes and revealed 606 novel proteins, 506 revised genes, 94 splice variants, 58 single amino acid variants, and a holistic view of post-translational modifications in P. tricor- nutum. We experimentally confirmed a subset of novel events and obtained MS evidence for more than 200 micropeptides in P. tricornutum. These findings expand the genomic landscape of P. tricornutum and provide a rich resource for the study of diatom biology. The proteogenomic pipeline we developed in this study is applicable to any sequenced eukaryote and thus represents a significant contribution to the toolset for eukaryotic proteogenomic analysis. The pipeline and its source code are freely available at https://sourceforge.net/projects/gapeproteogeno mic.
基金Supported by the National High Technology Research and Development Program of China(863 Program)(No.2012AA10A409)
文摘The kuruma prawn, Marsupenaeus japonicus, is one of the most cultivated and consumed species of shrimp. However, very few molecular genetic/genomic resources are publically available for it. Thus, the characterization and distribution of simple sequence repeats(SSRs) remains ambiguous and the use of SSR markers in genomic studies and marker-assisted selection is limited. The goal of this study is to characterize and develop genome-wide SSR markers in M. japonicus by genome survey sequencing for application in comparative genomics and breeding. A total of 326 945 perfect SSRs were identified, among which dinucleotide repeats were the most frequent class(44.08%), followed by mononucleotides(29.67%), trinucleotides(18.96%), tetranucleotides(5.66%), hexanucleotides(1.07%), and pentanucleotides(0.56%). In total, 151 541 SSR loci primers were successfully designed. A subset of 30 SSR primer pairs were synthesized and tested in 42 individuals from a wild population, of which 27 loci(90.0%) were successfully amplified with specific products and 24(80.0%) were polymorphic. For the amplified polymorphic loci, the alleles ranged from 5 to 17(with an average of 9.63), and the average PIC value was 0.796. A total of 58 256 SSR-containing sequences had significant Gene Ontology annotation; these are good functional molecular marker candidates for association studies and comparative genomic analysis. The newly identified SSRs significantly contribute to the M. japonicus genomic resources and will facilitate a number of genetic and genomic studies, including high density linkage mapping, genome-wide association analysis, marker-aided selection, comparative genomics analysis, population genetics, and evolution.
基金funded by the National Key Research and Development Program of China(2022YFF1003100-02,2020YFE0202900)the National Natural Science Foundation of China(32172543,31830081,22274130,32202411)+5 种基金Fundamental Research Funds for the Central Universities(JCQY201901,KYZ201888)Jiangsu Agriculture Science and Technology Innovation Fund(CX(19)2028)the seed industry promotion project of Jiangsu(JBGS(2021)022)the guidance foundation of Hainan Institute of Nanjing Agricultural University(NAUSY-MS08)the Earmarked Fund for China Agriculture Research System(CARS-28)the project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.
文摘Pear is an important fruit tree that is widely distributed around the world.The first pear genome map was reported from our laboratory approximately 10 years ago.To further study global protein expression patterns in pear,we generated pear proteome data based on 24 major tissues.The tissue-resolved profiles provided evidence of the expression of 17953 proteins.We identified 4294 new coding events and improved the pear genome annotation via the proteogenomic strategy based on 18090 peptide spectra with peptide spectrum matches>1.Among the eight randomly selected new short coding open reading frames that were expressed in the style,four promoted and one inhibited the growth of pear pollen tubes.Based on gene coexpression module analysis,we explored the key genes associated with important agronomic traits,such as stone cell formation in fruits.The network regulating the synthesis of lignin,a major component of stone cells,was reconstructed,and receptor-like kinases were implicated as core factors in this regulatory network.Moreover,we constructed the online database PearEXP(http://www.peardb.org.cn)to enable access to the pear proteogenomic resources.This study provides a paradigm for in-depth proteogenomic studies of woody plants.
基金supported by the National Key Research and Development Project(2018YFD0900301)the National Natural Science Foundation of China(31802325).
文摘Miiuy croaker,Miichthys miiuy is an ecologically important teleost species which is widely distributed in southeast coast of China.In this study,we present a chromosomal-scale genome assembly of the miiuy croaker which is an important Sciaenidae fish and economical species.We adopted Oxford Nanopore and Hi-C sequencing techniques to achieve an assembly with high accuracy and completeness.The investigation of genome characteristic and functional features may provide insights into the study of phylogenetic diversity of miiuy croaker.This study can also be beneficial to improve molecular assisted breeding techniques.Moreover,it can be a great resource to better conduct further study of other sciaenids.
基金supported by grants from the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA08020102)National Natural Science Foundation of China (Grant Nos. 81701071, 31501042, 31271385, and 31200957)+2 种基金Shenzhen Science and Technology Program (Grant No. JCYJ20170306171013613),ChinaKing Abdulaziz City for Science and Technology (KACSTGrant No. 1035-35),Kingdom of Saudi Arabia
文摘The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of versiondependent annotation files and other compatible public dataset for downstream analysis. To handlethese tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool(RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNAseq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool(RATT),RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification,genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/and https://github.com/wushyer/RGAAT;2 at no cost.
基金supported by Department of Biotechnology,Govt.of India as Programme Support for research and development in Agricultural Biotechnology at G.B.Pant University of Agriculture and Technology,Pantnagar(Grant No.BT/PR7849/AGR/02/2006)
文摘The mechanism of calcium uptake, translocation and accumulation in Poaceae has not yet been fully understood. To address this issue, we conducted genome-wide comparative in silico analysis of the calcium (Ca2+) transporter gene family of two crop species, rice and sorghum. Gene annotation, identification of upstream cis-acting ele- ments, phylogenetic tree construction and syntenic mapping of the gene family were performed using several bio- informatics tools. A total of 31 Ca2+ transporters, distributed on 9 out of 12 chromosomes, were predicted from rice genome, while 28 Ca2+ transporters predicted from sorghum are distributed on all the chromosomes except chromosome 10 (Chr 10). Interestingly, most of the genes on Chr 1 and Chr 3 show an inverse syntenic relation- ship between rice and sorghum. Multiple sequence alignment and motif analysis of these transporter proteins re- vealed high conservation between the two species. Phylogenetic tree could very well identify the subclasses of channels, ATPases and exchangers among the gene family. The in silico c/s-regulatory element analysis suggested diverse functions associated with light, stress and hormone responsiveness as well as endosperm- and meris- tem-specific gene expression. Further experiments are warranted to validate the in silico analysis of the predicted transporter gene family and elucidate the functions of Ca2+ transporters in various biological processes.
基金supported by the Strategic Priority Research Program of Chinese Academy of Sciences(Grant Nos.XDB38060100 and XDB38030200 to YBXDB38050300 to WZ+9 种基金XDB38030400 to JXXDA19050302 to ZZ)the National Key R&D Program of China(Grant Nos.2016YFE0206600 to YB2020YFC0847000,2018YFD1000505,2017YFC1201202,and 2016YFC0901603 to WZ2017YFC0907502 to ZZ)the 13th Five-year Informatization Plan of Chinese Academy of Sciences(Grant No.XXH13505-05 to YB)the Genomics Data Center Construction of Chinese Academy of Sciences(Grant No.XXH-13514-0202 to YB)the Open Biodiversity and Health Big Data Programme of International Union of Biological Sciences to YB,the Professional Association of the Alliance of International Science Organizations(Grant No.ANSO-PA-2020-07 to YB)the National Natural Science Foundation of China(Grant Nos.32030021 and 31871328 to ZZ)the International Partnership Program of the Chinese Academy of Sciences(Grant No.153F11KYSB20160008 to ZZ)。
文摘The Genome Warehouse(GWH)is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission,storage,release,and sharing.As one of the core resources in the National Genomics Data Center(NGDC),part of the China National Center for Bioinformation(CNCB;https://ngdc.cncb.ac.cn),GWH accepts both full and partial(chloroplast,mitochondrion,and plasmid)genome sequences with different assembly levels,as well as an update of existing genome assemblies.For each assembly,GWH collects detailed genome-related metadata of biological project,biological sample,and genome assembly,in addition to genome sequence and annotation.To archive high-quality genome sequences and annotations,GWH is equipped with a uniform and standardized procedure for quality control.Besides basic browse and search functionalities,all released genome sequences and annotations can be visualized with JBrowse.By May 21,2021,GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them.Collectively,GWH serves as an important resource for genomescale data management and provides free and publicly accessible data to support research activities throughout the world.GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.
基金the National Key Research and Development Project(2018YFD0900301)the National Natural Science Foundation of China(31802325).
文摘The blackspotted croaker(Protonibea diacanthus)is an endangered coastal marine fish.It is also a valuable species that is cultured on the southeast coast of China.While some genetic studies have been conducted to protect this species,genomic resources are lacking.Here,we report a chromosome-scale assembly of P.diacanthus genome by high-depth genome sequencing,assembly,and annotation.The genome scale was 635.69 Mb with contig and scaffold N50 length of 3.33 Mb and 25.60 Mb,respectively.Hi-C scaffolding of the genome resulted in 24 chromosomes of 94.15%total genome.We predicted 23,971 protein-coding genes.In addition,we constructed a phylogenetic tree using 2755 single-copy gene families and identified 462 unique gene families in P.diacanthus genome compared to three other sciaenids.What’s more,from the analysis of gene families,we found that several gene families related to innate immunity were significantly expanded in the blackspotted croaker genome compared to other teleost genomes.The high-quality genome can improve our understanding of the molecular mechanisms behind economically valuable traits and provide insights into characteristics of the immune system.
文摘Annotation of the genome sequence of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) is indispensable to understand its evolution and pathogenesis. We have performed a full annotation of the SARS-CoV genome sequences by using annotation programs publicly available or developed by ourselves. Totally, 21 open reading frames (ORFs) of genes or putative uncharacterized proteins (PUPs) were predicted. Seven PUPs had not been reported previously, and two of them were predicted to contain transmembrane regions. Eight ORFs partially overlapped with or embedded into those of known genes, revealing that the SARS-CoV genome is a small and compact one with overlapped coding regions. The most striking discovery is that an ORF locates on the minus strand. We have also annotated non-coding regions and identified the transcription regulating sequences (TRS) in the intergenic regions. The analysis of TRS supports the minus strand extending transcription mechanism of coronavirus. The SNP analysis of different isolates reveals that mutations of the sequences do not affect the prediction results of ORFs.
基金This work was partially supported by the National Natural Science Foundation of China (No.30470984)
文摘Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.
文摘The KEGG pathway maps are widely used as a reference data set for inferring high-level functions of the organism or the ecosystem from its genome or metagenome sequence data. The KEGG modules, which are tighter functional units often corresponding to subpathways in the KEGG pathway maps, are designed for better automation of genome interpretation. Each KEGG module is represented by a simple Boolean expression of KEGG Orthology (KO) identifiers (K numbers), enabling automatic evaluation of the completeness of genes in the genome. Here we focus on metabolic functions and introduce reaction modules for improving annotation and signature modules for inferring metabolic capacity. We also describe how genome annotation is performed in KEGG using the manually created KO database and the computationaUy generated SSDB database. The resulting KEGG GENES database with KO (K number) annotation is a reference sequence database to be compared for automated annotation and interpretation of newly determined genomes.