Although genome-wide association studies have identified more than eighty genetic variants associated with non-small cell lung cancer(NSCLC)risk,biological mechanisms of these variants remain largely unknown.By integr...Although genome-wide association studies have identified more than eighty genetic variants associated with non-small cell lung cancer(NSCLC)risk,biological mechanisms of these variants remain largely unknown.By integrating a large-scale genotype data of 15581 lung adenocarcinoma(AD)cases,8350 squamous cell carcinoma(SqCC)cases,and 27355 controls,as well as multiple transcriptome and epigenomic databases,we conducted histology-specific meta-analyses and functional annotations of both reported and novel susceptibility variants.We identified 3064 credible risk variants for NSCLC,which were overrepresented in enhancer-like and promoter-like histone modification peaks as well as DNase I hypersensitive sites.Transcription factor enrichment analysis revealed that USF1 was AD-specific while CREB1 was SqCC-specific.Functional annotation and genebased analysis implicated 894 target genes,including 274 specifics for AD and 123 for SqCC,which were overrepresented in somatic driver genes(ER=1.95,P=0.005).Pathway enrichment analysis and Gene-Set Enrichment Analysis revealed that AD genes were primarily involved in immune-related pathways,while SqCC genes were homologous recombination deficiency related.Our results illustrate the molecular basis of both wellstudied and new susceptibility loci of NSCLC,providing not only novel insights into the genetic heterogeneity between AD and SqCC but also a set of plausible gene targets for post-GWAS functional experiments.展开更多
Proteins function as integral actors in essential life processes,rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investig...Proteins function as integral actors in essential life processes,rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation.Within the context of protein research,an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings.Due to the exorbitant costs and limited throughput inherent in experimental investigations,computational models offer a promising alternative to accelerate protein function annotation.In recent years,protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks.This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction.In this review,we elucidate the historical evolution and research paradigms of computational methods for predicting protein function.Subsequently,we summarize the progress in protein and molecule representation as well as feature extraction techniques.Furthermore,we assess the performance of machine learning-based algorithms across various objectives in protein function prediction,thereby offering a comprehensive perspective on the progress within this field.展开更多
Background Brain hypoplasia and mental retardation in Down syndrome (DS) can be attributed to a severe and selective disruption of neurogenesis. Secondary disruption of the transcriptome, as well as primary gene dos...Background Brain hypoplasia and mental retardation in Down syndrome (DS) can be attributed to a severe and selective disruption of neurogenesis. Secondary disruption of the transcriptome, as well as primary gene dosage imbalance, is responsible for the phenotype. MicroRNA (miRNA) expression is relatively abundant in brain tissue. Perturbed miRNA expression might contribute to the cellular events underlying the pathology in DS. Methods MiRNA expression profiles in the cerebrum of Ts65Dn mice, a DS model, were examined with a real-time RT-PCR array. MiRNA target gene expression was detected by real-time quantitative PCR and Western blotting. Based on the prediction of their cerebrum-specific targets, the functions of the misregulated miRNAs were annotated by Gene Ontology (GO) enrichment analysis. Results A total of 342 miRNAs were examined. Among them, 20 miRNAs showed decreased expression in the brains of Ts65Dn mice, and some of these belonged to the same family. Two known targets of the miR-200 family, Lfng and Zeb2, were specifically selected to compare their expression in the cerebrum of Ts65Dn mice with those of euploids. However, no significant difference was found in terms of mRNA and protein expression levels of these genes. By enrichment analysis of the cerebrum-specific targets of each miRNA, we found that 15 of the differential miRNAs could significantly affect target genes that were enriched in the GO biological processes related to nervous system development. Conclusion Perturbed expression of multiple functionally cooperative miRNAs contributes to the cellular events underlying the pathogenesis of DS.展开更多
Lentil(Lens culinaris Medik.), a diploid(2n = 14) with a genome size greater than 4000 Mbp, is an important cool season food legume grown worldwide. The availability of genomic resources is limited in this crop specie...Lentil(Lens culinaris Medik.), a diploid(2n = 14) with a genome size greater than 4000 Mbp, is an important cool season food legume grown worldwide. The availability of genomic resources is limited in this crop species. The objective of this study was to develop polymorphic markers in lentil using publicly available curated expressed sequence tag information(ESTs). In this study, 9513 ESTs were downloaded from the National Center for Biotechnology Information(NCBI) database to develop unigene-based simple sequence repeat(SSR) markers. The ESTs were assembled into 4053 unigenes and then analyzed to identify 374 SSRs using the MISA microsatellite identification tool. Among the 374 SSRs, 26 compound SSRs were observed.Primer pairs for these SSRs were designed using Primer3 version 1.14. To classify the functional annotation of ESTs and EST–SSRs, BLASTx searches(using E-value 1 × 10-5) against the public UniP rot(http://www.uniprot.org/) and NCBI(http://www.ncbi.nlh.nih.gov/) databases were performed. Further functional annotation was performed using PLAZA(version3.0) comparative genomics and GO annotation was summarized using the Plant GO slim category. Among the synthesized 312 primers, 219 successfully amplified Lens DNA. A diverse panel of 24 Lens genotypes was used to identify polymorphic markers. A polymorphic set of 57 markers successfully discriminated the test genotypes. This set of polymorphic markers with functional annotation data could be used as molecular tools in lentil breeding.展开更多
Life may have begun in an RNA world,which is supported by increasing evidence of the vital role that RNAs perform in biological systems.In the human genome,most genes actually do not encode proteins;they are noncoding...Life may have begun in an RNA world,which is supported by increasing evidence of the vital role that RNAs perform in biological systems.In the human genome,most genes actually do not encode proteins;they are noncoding RNA genes.The largest class of noncoding genes is known as long noncoding RNAs(lncRNAs),which are transcripts greater in length than 200 nucleotides,but with no protein-coding capacity.While some lncRNAs have been demonstrated to be key regulators of gene expression and 3D genome organization,most lncRNAs are still uncharacterized.We thus propose several data mining and machine learning approaches for the functional annotation of human lncRNAs by leveraging the vast amount of data from genetic and genomic studies.Recent results from our studies and those of other groups indicate that genomic data mining can give insights into lncRNA functions and provide valuable information for experimental studies of candidate lncRNAs associated with human disease.展开更多
The discovery of novel cancer genes is one of the main goals in cancer research.Bioinformatics methods can be used to accelerate cancer gene discovery,which may help in the understanding of cancer and the development ...The discovery of novel cancer genes is one of the main goals in cancer research.Bioinformatics methods can be used to accelerate cancer gene discovery,which may help in the understanding of cancer and the development of drug targets.In this paper,we describe a classifier to predict potential cancer genes that we have developed by integrating multiple biological evidence,including protein-protein interaction network properties,and sequence and functional features.We detected 55 features that were significantly different between cancer genes and non-cancer genes.Fourteen cancer-associated features were chosen to train the classifier.Four machine learning methods,logistic regression,support vector machines(SVMs),BayesNet and decision tree,were explored in the classifier models to distinguish cancer genes from non-cancer genes.The prediction power of the different models was evaluated by 5-fold cross-validation.The area under the receiver operating characteristic curve for logistic regression,SVM,Baysnet and J48 tree models was 0.834,0.740,0.800 and 0.782,respectively.Finally,the logistic regression classifier with multiple biological features was applied to the genes in the Entrez database,and 1976 cancer gene candidates were identified.We found that the integrated prediction model performed much better than the models based on the individual biological evidence,and the network and functional features had stronger powers than the sequence features in predicting cancer genes.展开更多
Objective: Liver metastasis,which contributes substantially to high mortality,is the most common recurrent mode of colon carcinoma.Thus,it is necessary to identify genes implicated in metastatic colonization of the li...Objective: Liver metastasis,which contributes substantially to high mortality,is the most common recurrent mode of colon carcinoma.Thus,it is necessary to identify genes implicated in metastatic colonization of the liver in colon carcinoma.Methods: We compared mRNA profiling in 18 normal colon mucosa(N),20 primary tumors(T) and 19 liver metastases(M) samples from the dataset GSE49355 and GSE62321 of Gene Expression Omnibus(GEO) database.Gene ontology(GO) and pathways of the identified genes were analyzed.Co-expression network and proteinprotein interaction(PPI) network were employed to identify the interaction relationship.Survival analyses based on The Cancer Genome Atlas(TCGA) database were used to further screening.Then,the candidate genes were validated by our data.Results: We identified 22 specific genes related to liver metastasis and they were strongly associated with cell migration,adhesion,proliferation and immune response.Simultaneously,the results showed that C-X-C motif chemokine ligand 14(CXCL14) might be a favorable prediction factor for survival of patients with colon carcinoma.Importantly,our validated data further suggested that lower CXCL14 represented poorer outcome and contributed to metastasis.Gene set enrichment analysis(GSEA) showed that CXCL14 was negatively related to the regulation of stem cell proliferation and epithelial to mesenchymal transition(EMT).Conclusions: CXCL14 was identified as a crucial anti-metastasis regulator of colon carcinoma for the first time,and might provide novel therapeutic strategies for colon carcinoma patients to improve prognosis and prevent metastasis.展开更多
With recent advances in genotyping and sequencing technologies,many disease susceptibility loci have been identified.However,much of the genetic heritability remains unexplained and the replication rate between indepe...With recent advances in genotyping and sequencing technologies,many disease susceptibility loci have been identified.However,much of the genetic heritability remains unexplained and the replication rate between independent studies is still low.Meanwhile,there have been increasing efforts on functional annotations of the entire human genome,such as the Encyclopedia of DNA Elements(ENCODE)project and other similar projects.It has been shown that incorporating these functional annotations to prioritize genome wide association signals may help identify true association signals.However,to our knowledge,the extent of the improvement when functional annotation data are considered has not been studied in the literature.In this article,we propose a statistical framework to estimate the improvement in replication rate with annotation data,and apply it to Crohn’s disease and DNase I hypersensitive sites.The results show that with cell line specific functional annotations,the expected replication rate is improved,but only at modest level.展开更多
The kuruma prawn, Marsupenaeus japonicus, is one of the most cultivated and consumed species of shrimp. However, very few molecular genetic/genomic resources are publically available for it. Thus, the characterization...The kuruma prawn, Marsupenaeus japonicus, is one of the most cultivated and consumed species of shrimp. However, very few molecular genetic/genomic resources are publically available for it. Thus, the characterization and distribution of simple sequence repeats(SSRs) remains ambiguous and the use of SSR markers in genomic studies and marker-assisted selection is limited. The goal of this study is to characterize and develop genome-wide SSR markers in M. japonicus by genome survey sequencing for application in comparative genomics and breeding. A total of 326 945 perfect SSRs were identified, among which dinucleotide repeats were the most frequent class(44.08%), followed by mononucleotides(29.67%), trinucleotides(18.96%), tetranucleotides(5.66%), hexanucleotides(1.07%), and pentanucleotides(0.56%). In total, 151 541 SSR loci primers were successfully designed. A subset of 30 SSR primer pairs were synthesized and tested in 42 individuals from a wild population, of which 27 loci(90.0%) were successfully amplified with specific products and 24(80.0%) were polymorphic. For the amplified polymorphic loci, the alleles ranged from 5 to 17(with an average of 9.63), and the average PIC value was 0.796. A total of 58 256 SSR-containing sequences had significant Gene Ontology annotation; these are good functional molecular marker candidates for association studies and comparative genomic analysis. The newly identified SSRs significantly contribute to the M. japonicus genomic resources and will facilitate a number of genetic and genomic studies, including high density linkage mapping, genome-wide association analysis, marker-aided selection, comparative genomics analysis, population genetics, and evolution.展开更多
Fabaceae is the third largest family of flowering plants and is unique among crops in their ability of fixing atmospheric nitrogen. Fabaceae is one of the few plant families with extensive genomic data available in mu...Fabaceae is the third largest family of flowering plants and is unique among crops in their ability of fixing atmospheric nitrogen. Fabaceae is one of the few plant families with extensive genomic data available in multiple species. The unprecedented complexity and impending completeness of these data create opportunities for discovering new approaches. The Legume and Medicago share much-conserved colinearity between their genomes which can be exploited for the genomic research in Leguminosae crops. In this study, 1,952,191 ESTs of 8 Leguminosae species were clustered into unigenes contigs and compared with Medicago truncatula gene indices. Almost all the unigenes of Leguminosae species showed high similarity with Medicago genes, except for those of Lens culinaris, where 95% of unigenes were found similar. A total of 10,874 SSRs were identified in the unigenes. Functional annotation of unigenes showed that the majority of the genes are present in metabolism and energy functional classes. It is expected that comparative genomic analysis between Medicago and related crop species will expedite research in other Legume species. This would be helpful for genomics as well as evolutionary studies, and the DNA markers developed can be used for mapping, tagging and cloning of specific important genes in Leguminosae.展开更多
Gene Ontology(GO)has been widely used to annotate functions of genes and gene products.Here,we proposed a new method,Triplet GO,to deduce GO terms of protein-coding and noncoding genes,through the integration of four ...Gene Ontology(GO)has been widely used to annotate functions of genes and gene products.Here,we proposed a new method,Triplet GO,to deduce GO terms of protein-coding and noncoding genes,through the integration of four complementary pipelines built on transcript expression profile,genetic sequence alignment,protein sequence alignment,and naīve probability.Triplet GO was tested on a large set of 5754 genes from 8 species(human,mouse,Arabidopsis,rat,fly,budding yeast,fission yeast,and nematoda)and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge(CAFA3).Experimental results show that Triplet GO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches.Detailed analyses show that the major advantage of Triplet GO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique,which can accurately recognize function patterns from transcript expression profiles.Meanwhile,the combination of multiple complementary models,especially those from transcript expression and protein-level alignments,improves the coverage and accuracy of the final GO annotation results.The standalone package and an online server of Triplet GO are freely available at https://zhanggroup.org/Triplet GO/.展开更多
Clustering is perhaps one of the most widely used tools for microarray data analysis. Proposed roles for genes of unknown function are inferred from clusters of genes similarity expressed across many biological condit...Clustering is perhaps one of the most widely used tools for microarray data analysis. Proposed roles for genes of unknown function are inferred from clusters of genes similarity expressed across many biological conditions. However, whether function annotation by similarity metrics is reliable or not and to what extent the similarity in gene expression patterns is useful for annotation of gene functions, has not been evaluated. This paper made a comprehensive research on the correlation between the similarity of expression data and of gene functions using Gene Ontology. It has been found that although the similarity in expression patterns and the similarity in gene functions are significantly dependent on each other, this association is rather weak. In addition, among the three categories of Gene Ontology, the similarity of expression data is more useful for cellular component annotation than for biological process and molecular function. The results presented are interesting for the gene functions prediction research area.展开更多
Lonicera japonica Thunb.,a traditional Chinese herb,has been used for treating human diseases for thousands of years.Recently,the genome of L.japonica has been decoded,providing valuable information for research into ...Lonicera japonica Thunb.,a traditional Chinese herb,has been used for treating human diseases for thousands of years.Recently,the genome of L.japonica has been decoded,providing valuable information for research into gene function.However,no comprehensive database for gene functional analysis and mining is available for L.japonica.We therefore constructed LjaFGD(www.gzybioinformatics.cn/LjaFGD and bioinformatics.cau.edu.cn/LjaFGD),a database for analyzing and comparing gene function in L.japonica.We constructed a gene co-expression network based on 77 RNA-seq samples,and then annotated genes of L.japonica by alignment against protein sequences from public databases.We also introduced several tools for gene functional analysis,including Blast,motif analysis,gene set enrichment analysis,heatmap analysis,and JBrowse.Our co-expression network revealed that MYB and WRKY transcription factor family genes were co-expressed with genes encoding key enzymes in the biosynthesis of chlorogenic acid and luteolin in L.japonica.We used flavonol synthase 1(LjFLS1)as an example to show the reliability and applicability of our database.LjaFGD and its various associated tools will provide researchers with an accessible platform for retrieving functional information on L.japonica genes to further biological discovery.展开更多
Soil metaproteomics has excellent potential as a tool to elucidate the structural and functional changes in soil microbial communities in response to environmental alterations. However, soil metaproteomics is hindered...Soil metaproteomics has excellent potential as a tool to elucidate the structural and functional changes in soil microbial communities in response to environmental alterations. However, soil metaproteomics is hindered by several challenges and gaps. Soil microbial communities possess extremely complex microbial composition, including many uncultured microorganisms without whole genome sequencing. Thus, how to select a suitable protein sequence database remains challenging in soil metaproteomics. In this study, the Public database and Meta-database were constructed using protein sequences from public databases and metagenomics, respectively. We comprehensively analyzed and compared the soil metaproteomic results using these two kinds of protein sequence databases for protein identification based on published soil metaproteomic raw data. The results demonstrated that many more proteins, higher sequence coverage, and even more microbial species and functional annotations could be identified using the Meta-database compared with those identified using the Public database. These findings indicated that the Meta-database was more specific as a protein sequence database. However, the follow-up in-depth metaproteomic analyses exhibited similar main results regardless of the database used. The microbial community composition at the genus level was similar between the two databases, especially the species annotations with high peptide-spectrum match and high abundance. The functional analyses in response to stress, such as the gene ontology enrichment of biological progress and molecular function and the key functional microorganisms, were also similar regardless of the database. Our analysis revealed that the Public database could also meet the demand to explore the functional responses of microbial proteins to some extent. This study provides valuable insights into the choice of protein sequence databases and their impacts on subsequent bioinformatic analysis in soil metaproteomic research and will facilitate the optimization of experimental design for different purposes.展开更多
As the most pervasive epigenetic marker present on mRNAs and long non-coding RNAs(lncRNAs),N6-methyladenosine(m^(6)A)RNA methylation has been shown to participate in essential biological processes.Recent studies have ...As the most pervasive epigenetic marker present on mRNAs and long non-coding RNAs(lncRNAs),N6-methyladenosine(m^(6)A)RNA methylation has been shown to participate in essential biological processes.Recent studies have revealed the distinct patterns of m^(6)A methylome across human tissues,and a major challenge remains in elucidating the tissue-specific presence and circuitry of m^(6)A methylation.We present here a comprehensive online platform,m^(6)A-TSHub,for unveiling the context-specific m^(6)A methylation and genetic mutations that potentially regulate m^(6)A epigenetic mark.m^(6)A-TSHub consists of four core components,including(1)m^(6)A-TSDB,a comprehensive database of 184,554 functionally annotated m^(6)A sites derived from 23 human tissues and 499,369 m^(6)A sites from 25 tumor conditions,respectively;(2)m^(6)A-TSFinder,a web server for high-accuracy prediction of m^(6)A methylation sites within a specific tissue from RNA sequences,which was constructed using multi-instance deep neural networks with gated attention;(3)m^(6)ATSVar,a web server for assessing the impact of genetic variants on tissue-specific m^(6)A RNA modifications;and(4)m^(6)A-CAVar,a database of 587,983 The Cancer Genome Atlas(TCGA)cancer mutations(derived from 27 cancer types)that were predicted to affect m^(6)A modifications in the primary tissue of cancers.The database should make a useful resource for studying the m^(6)A methylome and the genetic factors of epitranscriptome disturbance in a specific tissue(or cancer type).m^(6)A-TSHub is accessible at www.xjtlu.edu.cn/biologicalsciences/m^(6)ats.展开更多
Mammals have evolved mechanisms to sense hypoxia and induce hypoxic responses.Recently,high-throughput techniques have greatly promoted global studies of protein expression changes during hypoxia and the identificatio...Mammals have evolved mechanisms to sense hypoxia and induce hypoxic responses.Recently,high-throughput techniques have greatly promoted global studies of protein expression changes during hypoxia and the identification of candidate genes associated with hypoxiaadaptive evolution,which have contributed to the understanding of the complex regulatory networks of hypoxia.In this study,we developed an integrated resource for the expression dynamics of proteins in response to hypoxia(iHypoxia),and this database contains 2589 expression events of 1944 proteins identified by low-throughput experiments(LTEs)and 422,553 quantitative expression events of 33,559 proteins identified by high-throughput experiments from five mammals that exhibit a response to hypoxia.Various experimental details,such as the hypoxic experimental conditions,expression patterns,and sample types,were carefully collected and integrated.Furthermore,8788 candidate genes from diverse species inhabiting low-oxygen environments were also integrated.In addition,we conducted an orthologous search and computationally identified 394,141 proteins that may respond to hypoxia among 48 animals.An enrichment analysis of human proteins identified from LTEs shows that these proteins are enriched in certain drug targets and cancer genes.Annotation of known posttranslational modification(PTM)sites in the proteins identified by LTEs reveals that these proteins undergo extensive PTMs,particularly phosphorylation,ubiquitination,and acetylation.iHypoxia provides a convenient and user-friendly method for users to obtain hypoxia-related information of interest.展开更多
Genetic and epigenetic changes after polyploidization events could result in variable gene expression and modified regulatory networks.Here,using large-scale transcriptome data,we constructed co-expression networks fo...Genetic and epigenetic changes after polyploidization events could result in variable gene expression and modified regulatory networks.Here,using large-scale transcriptome data,we constructed co-expression networks for diploid,tetraploid,and hexaploid wheat species,and built a platform for comparing co-expression networks of allohexaploid wheat and its progenitors,named WheatCENet.WheatCENet is a platform for searching and comparing specific functional coexpression networks,as well as identifying the related functions of the genes clustered therein.Functional annotations like pathways,gene families,protein-protein interactions,microRNAs(miRNAs),and several lines of epigenome data are integrated into this platform,and Gene Ontology(GO)annotation,gene set enrichment analysis(GSEA),motif identification,and other useful tools are also included.Using WheatCENet,we found that the network of WHEAT ABERRANT PANICLE ORGANIZATION I(WAPOI)has more co-expressed genes related to spike development in hexaploid wheat than its progenitors.We also found a novel motif of CCWWWWWWGG(CArG)specifically in the promoter region of WAPO-Al,suggesting that neofunctionalization of the WAPO-AI gene affects spikelet development in hexaploid wheat.WheatCENet is useful for investigating co-expression networks and conducting other analyses,and thus facilitates comparative and functional genomic studies in wheat.展开更多
The Malaysian mahseer(Tor tambroides),one of the most valuable freshwater fish in the world,is mainly targeted for human consumption.The mitogenomic data of this species is available to date,but the genomic informatio...The Malaysian mahseer(Tor tambroides),one of the most valuable freshwater fish in the world,is mainly targeted for human consumption.The mitogenomic data of this species is available to date,but the genomic information is still lacking.For the first time,we sequenced the whole genome of an adult fish on both Illumina and Nanopore platforms.The hybrid genome assembly had resulted in a sum of 1.23 Gb genomic sequence from the 44,726 contigs found with 44 kb N50 length and BUSCO genome completeness of 87.6%.Four types of SSRs had been detected and identified within the genome with a greater AT abundance than that of GC.Predicted protein sequences had been functionally annotated to public databases,namely GO,KEGG and COG.A maximum likelihood phylogenomic tree containing 52 Actinopterygii species and one Sarcopterygii species as outgroup was constructed,providing first insights into the genome-based evolutionary relationship of T.tambroides with other ray-finned fish.These data are crucial in facilitating the study of population genomics,species identification,morphological variations,and evolutionary biology,which are helpful in the conservation of this species.展开更多
Eukaryotic mRNAs consist of two forms of transcripts:poly(A)+ and poly(A),based on the presence or absence of poly(A) tails at the 3 end.Poly(A)+ mRNAs are mainly protein coding mRNAs,whereas the functions of poly(A) ...Eukaryotic mRNAs consist of two forms of transcripts:poly(A)+ and poly(A),based on the presence or absence of poly(A) tails at the 3 end.Poly(A)+ mRNAs are mainly protein coding mRNAs,whereas the functions of poly(A) mRNA are largely unknown.Previous studies have shown that a significant proportion of gene transcripts are poly(A) or bimorphic(containing both poly(A)+ and poly(A) transcripts).We compared the expression levels of poly(A) and poly(A)+ RNA mRNAs in normal and cancer cell lines.We also investigated the potential functions of these RNA transcripts using an integrative workflow to explore poly(A)+ and poly(A) transcriptome sequences between a normal human mammary gland cell line(HMEC) and a breast cancer cell line(MCF-7),as well as between a normal human lung cell line(NHLF) and a lung cancer cell line(A549).The data showed that normal and cancer cell lines differentially express these two forms of mRNA.Gene ontology(GO) annotation analyses hinted at the functions of these two groups of transcripts and grouped the differentially expressed genes according to the form of their transcript.The data showed that cell cycle-,apoptosis-,and cell death-related functions corresponded to most of the differentially expressed genes in these two forms of transcripts,which were also associated with the cancers.Furthermore,translational elongation and translation functions were also found for the poly(A) protein-coding genes in cancer cell lines.We demonstrate that poly(A) transcripts play an important role in cancer development.展开更多
The functional impact of several long intergenic non-coding RNAs (lincRNAs) has been characterized in previous studies. However, it is difficult to identify lincRNAs on a large-scale and to ascertain their functions o...The functional impact of several long intergenic non-coding RNAs (lincRNAs) has been characterized in previous studies. However, it is difficult to identify lincRNAs on a large-scale and to ascertain their functions or predict their structures in laboratory experiments because of the diversity, lack of knowledge and specificity of expression of lincRNAs. Furthermore, although there are a few well-characterized examples of lincRNAs associated with cancers, these are just the tip of the iceberg owing to the complexity of cancer. Here, by combining RNA-Seq data from several kinds of human cell lines with chromatin-state maps and human expressed sequence tags, we successfully identified more than 3000 human lincRNAs, most of which were new ones. Subsequently, we predicted the functions of 105 lincRNAs based on a coding-non-coding gene co-expression network. Finally, we propose a genetic mediator and key regulator model to unveil the subtle relationships between lincRNAs and lung cancer. Twelve lincRNAs may be principal players in lung tumorigenesis. The present study combines large-scale identification and functional prediction of human lincRNAs, and is a pioneering work in characterizing cancer-associated lincRNAs by bioinformatics.展开更多
基金the Key International(Regional)Cooperative Research Project(No.81820108028)the National Natural Science Foundation of China(Nos.81521004,81922061,81973123,and 81803306)+2 种基金the Science Foundation for Distinguished Young Scholars of Jiangsu(No.BK20160046)the Priority Academic Program for the Development of Jiangsu Higher Education Institutions(Public Health and Preventive Medicine).the National Cancer Institute,National Institutes of Health of USA through grants U01-CA063673,UM1-CA167462,and U01-CA167462.
文摘Although genome-wide association studies have identified more than eighty genetic variants associated with non-small cell lung cancer(NSCLC)risk,biological mechanisms of these variants remain largely unknown.By integrating a large-scale genotype data of 15581 lung adenocarcinoma(AD)cases,8350 squamous cell carcinoma(SqCC)cases,and 27355 controls,as well as multiple transcriptome and epigenomic databases,we conducted histology-specific meta-analyses and functional annotations of both reported and novel susceptibility variants.We identified 3064 credible risk variants for NSCLC,which were overrepresented in enhancer-like and promoter-like histone modification peaks as well as DNase I hypersensitive sites.Transcription factor enrichment analysis revealed that USF1 was AD-specific while CREB1 was SqCC-specific.Functional annotation and genebased analysis implicated 894 target genes,including 274 specifics for AD and 123 for SqCC,which were overrepresented in somatic driver genes(ER=1.95,P=0.005).Pathway enrichment analysis and Gene-Set Enrichment Analysis revealed that AD genes were primarily involved in immune-related pathways,while SqCC genes were homologous recombination deficiency related.Our results illustrate the molecular basis of both wellstudied and new susceptibility loci of NSCLC,providing not only novel insights into the genetic heterogeneity between AD and SqCC but also a set of plausible gene targets for post-GWAS functional experiments.
基金supported in part by the National Natural Science Foundation of China(22033001)the National Key R&D Program of China(2022YFA1303700)the Chinese Academy of Medical Sciences(2021-I2M-5-014).
文摘Proteins function as integral actors in essential life processes,rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation.Within the context of protein research,an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings.Due to the exorbitant costs and limited throughput inherent in experimental investigations,computational models offer a promising alternative to accelerate protein function annotation.In recent years,protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks.This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction.In this review,we elucidate the historical evolution and research paradigms of computational methods for predicting protein function.Subsequently,we summarize the progress in protein and molecule representation as well as feature extraction techniques.Furthermore,we assess the performance of machine learning-based algorithms across various objectives in protein function prediction,thereby offering a comprehensive perspective on the progress within this field.
基金This study was supported by a grant from the National Natural Science Foundation of China (No. 30540033). There are no conflicts of interest.
文摘Background Brain hypoplasia and mental retardation in Down syndrome (DS) can be attributed to a severe and selective disruption of neurogenesis. Secondary disruption of the transcriptome, as well as primary gene dosage imbalance, is responsible for the phenotype. MicroRNA (miRNA) expression is relatively abundant in brain tissue. Perturbed miRNA expression might contribute to the cellular events underlying the pathology in DS. Methods MiRNA expression profiles in the cerebrum of Ts65Dn mice, a DS model, were examined with a real-time RT-PCR array. MiRNA target gene expression was detected by real-time quantitative PCR and Western blotting. Based on the prediction of their cerebrum-specific targets, the functions of the misregulated miRNAs were annotated by Gene Ontology (GO) enrichment analysis. Results A total of 342 miRNAs were examined. Among them, 20 miRNAs showed decreased expression in the brains of Ts65Dn mice, and some of these belonged to the same family. Two known targets of the miR-200 family, Lfng and Zeb2, were specifically selected to compare their expression in the cerebrum of Ts65Dn mice with those of euploids. However, no significant difference was found in terms of mRNA and protein expression levels of these genes. By enrichment analysis of the cerebrum-specific targets of each miRNA, we found that 15 of the differential miRNAs could significantly affect target genes that were enriched in the GO biological processes related to nervous system development. Conclusion Perturbed expression of multiple functionally cooperative miRNAs contributes to the cellular events underlying the pathogenesis of DS.
基金Financial assistance from ICARDA, Morocco, in the form of a brief projectgrant support from the Northern Pulse Growers Association and the USA Dry Pea and Lentil Council are gratefully acknowledged
文摘Lentil(Lens culinaris Medik.), a diploid(2n = 14) with a genome size greater than 4000 Mbp, is an important cool season food legume grown worldwide. The availability of genomic resources is limited in this crop species. The objective of this study was to develop polymorphic markers in lentil using publicly available curated expressed sequence tag information(ESTs). In this study, 9513 ESTs were downloaded from the National Center for Biotechnology Information(NCBI) database to develop unigene-based simple sequence repeat(SSR) markers. The ESTs were assembled into 4053 unigenes and then analyzed to identify 374 SSRs using the MISA microsatellite identification tool. Among the 374 SSRs, 26 compound SSRs were observed.Primer pairs for these SSRs were designed using Primer3 version 1.14. To classify the functional annotation of ESTs and EST–SSRs, BLASTx searches(using E-value 1 × 10-5) against the public UniP rot(http://www.uniprot.org/) and NCBI(http://www.ncbi.nlh.nih.gov/) databases were performed. Further functional annotation was performed using PLAZA(version3.0) comparative genomics and GO annotation was summarized using the Plant GO slim category. Among the synthesized 312 primers, 219 successfully amplified Lens DNA. A diverse panel of 24 Lens genotypes was used to identify polymorphic markers. A polymorphic set of 57 markers successfully discriminated the test genotypes. This set of polymorphic markers with functional annotation data could be used as molecular tools in lentil breeding.
基金supported by the Self Regional Healthcare Foundation,USA
文摘Life may have begun in an RNA world,which is supported by increasing evidence of the vital role that RNAs perform in biological systems.In the human genome,most genes actually do not encode proteins;they are noncoding RNA genes.The largest class of noncoding genes is known as long noncoding RNAs(lncRNAs),which are transcripts greater in length than 200 nucleotides,but with no protein-coding capacity.While some lncRNAs have been demonstrated to be key regulators of gene expression and 3D genome organization,most lncRNAs are still uncharacterized.We thus propose several data mining and machine learning approaches for the functional annotation of human lncRNAs by leveraging the vast amount of data from genetic and genomic studies.Recent results from our studies and those of other groups indicate that genomic data mining can give insights into lncRNA functions and provide valuable information for experimental studies of candidate lncRNAs associated with human disease.
基金supported by the National Natural Science Foundation of China (31000591,31000587,31171266)
文摘The discovery of novel cancer genes is one of the main goals in cancer research.Bioinformatics methods can be used to accelerate cancer gene discovery,which may help in the understanding of cancer and the development of drug targets.In this paper,we describe a classifier to predict potential cancer genes that we have developed by integrating multiple biological evidence,including protein-protein interaction network properties,and sequence and functional features.We detected 55 features that were significantly different between cancer genes and non-cancer genes.Fourteen cancer-associated features were chosen to train the classifier.Four machine learning methods,logistic regression,support vector machines(SVMs),BayesNet and decision tree,were explored in the classifier models to distinguish cancer genes from non-cancer genes.The prediction power of the different models was evaluated by 5-fold cross-validation.The area under the receiver operating characteristic curve for logistic regression,SVM,Baysnet and J48 tree models was 0.834,0.740,0.800 and 0.782,respectively.Finally,the logistic regression classifier with multiple biological features was applied to the genes in the Entrez database,and 1976 cancer gene candidates were identified.We found that the integrated prediction model performed much better than the models based on the individual biological evidence,and the network and functional features had stronger powers than the sequence features in predicting cancer genes.
基金supported by grants from the National Natural Science Foundation of China(No.8177061284)
文摘Objective: Liver metastasis,which contributes substantially to high mortality,is the most common recurrent mode of colon carcinoma.Thus,it is necessary to identify genes implicated in metastatic colonization of the liver in colon carcinoma.Methods: We compared mRNA profiling in 18 normal colon mucosa(N),20 primary tumors(T) and 19 liver metastases(M) samples from the dataset GSE49355 and GSE62321 of Gene Expression Omnibus(GEO) database.Gene ontology(GO) and pathways of the identified genes were analyzed.Co-expression network and proteinprotein interaction(PPI) network were employed to identify the interaction relationship.Survival analyses based on The Cancer Genome Atlas(TCGA) database were used to further screening.Then,the candidate genes were validated by our data.Results: We identified 22 specific genes related to liver metastasis and they were strongly associated with cell migration,adhesion,proliferation and immune response.Simultaneously,the results showed that C-X-C motif chemokine ligand 14(CXCL14) might be a favorable prediction factor for survival of patients with colon carcinoma.Importantly,our validated data further suggested that lower CXCL14 represented poorer outcome and contributed to metastasis.Gene set enrichment analysis(GSEA) showed that CXCL14 was negatively related to the regulation of stem cell proliferation and epithelial to mesenchymal transition(EMT).Conclusions: CXCL14 was identified as a crucial anti-metastasis regulator of colon carcinoma for the first time,and might provide novel therapeutic strategies for colon carcinoma patients to improve prognosis and prevent metastasis.
基金supported in part by the National Institutes of Health(R01 GM59507 and U01 HG005718)the VA Cooperative Studies Program of the Department of Veterans Affairs,Office of Research and Development
文摘With recent advances in genotyping and sequencing technologies,many disease susceptibility loci have been identified.However,much of the genetic heritability remains unexplained and the replication rate between independent studies is still low.Meanwhile,there have been increasing efforts on functional annotations of the entire human genome,such as the Encyclopedia of DNA Elements(ENCODE)project and other similar projects.It has been shown that incorporating these functional annotations to prioritize genome wide association signals may help identify true association signals.However,to our knowledge,the extent of the improvement when functional annotation data are considered has not been studied in the literature.In this article,we propose a statistical framework to estimate the improvement in replication rate with annotation data,and apply it to Crohn’s disease and DNase I hypersensitive sites.The results show that with cell line specific functional annotations,the expected replication rate is improved,but only at modest level.
基金Supported by the National High Technology Research and Development Program of China(863 Program)(No.2012AA10A409)
文摘The kuruma prawn, Marsupenaeus japonicus, is one of the most cultivated and consumed species of shrimp. However, very few molecular genetic/genomic resources are publically available for it. Thus, the characterization and distribution of simple sequence repeats(SSRs) remains ambiguous and the use of SSR markers in genomic studies and marker-assisted selection is limited. The goal of this study is to characterize and develop genome-wide SSR markers in M. japonicus by genome survey sequencing for application in comparative genomics and breeding. A total of 326 945 perfect SSRs were identified, among which dinucleotide repeats were the most frequent class(44.08%), followed by mononucleotides(29.67%), trinucleotides(18.96%), tetranucleotides(5.66%), hexanucleotides(1.07%), and pentanucleotides(0.56%). In total, 151 541 SSR loci primers were successfully designed. A subset of 30 SSR primer pairs were synthesized and tested in 42 individuals from a wild population, of which 27 loci(90.0%) were successfully amplified with specific products and 24(80.0%) were polymorphic. For the amplified polymorphic loci, the alleles ranged from 5 to 17(with an average of 9.63), and the average PIC value was 0.796. A total of 58 256 SSR-containing sequences had significant Gene Ontology annotation; these are good functional molecular marker candidates for association studies and comparative genomic analysis. The newly identified SSRs significantly contribute to the M. japonicus genomic resources and will facilitate a number of genetic and genomic studies, including high density linkage mapping, genome-wide association analysis, marker-aided selection, comparative genomics analysis, population genetics, and evolution.
文摘Fabaceae is the third largest family of flowering plants and is unique among crops in their ability of fixing atmospheric nitrogen. Fabaceae is one of the few plant families with extensive genomic data available in multiple species. The unprecedented complexity and impending completeness of these data create opportunities for discovering new approaches. The Legume and Medicago share much-conserved colinearity between their genomes which can be exploited for the genomic research in Leguminosae crops. In this study, 1,952,191 ESTs of 8 Leguminosae species were clustered into unigenes contigs and compared with Medicago truncatula gene indices. Almost all the unigenes of Leguminosae species showed high similarity with Medicago genes, except for those of Lens culinaris, where 95% of unigenes were found similar. A total of 10,874 SSRs were identified in the unigenes. Functional annotation of unigenes showed that the majority of the genes are present in metabolism and energy functional classes. It is expected that comparative genomic analysis between Medicago and related crop species will expedite research in other Legume species. This would be helpful for genomics as well as evolutionary studies, and the DNA markers developed can be used for mapping, tagging and cloning of specific important genes in Leguminosae.
基金supported in part by the National Natural Science Foundation of China(Grant Nos.62072243 and 61772273 to Dong-Jun Yu)the Natural Science Foundation of Jiangsu,China(Grant No.BK20201304 to Dong-Jun Yu)+7 种基金the Foundation of National Defense Key Laboratory of Science and Technology,China(Grant No.JZX7Y202001SY000901 to DongJun Yu)the China Scholarship Council(Grant No.201906840041 to Yi-Heng Zhu)the National Institute of Environmental Health Sciences,USA(Grant No.P30ES017885 to Gilbert S.Omenn)the National Cancer Institute,USA(Grant No.U24CA210967 to Gilbert S.Omenn)the National Institute of General Medical Sciences,USA(Grant Nos.GM136422 and S10OD026825 to Yang Zhang)the National Institute of Allergy and Infectious Diseases,USA(Grant No.AI134678 to Peter L.Freddolino and Yang Zhang)the National Science Foundation,USA(Grant Nos.IIS1901191,DBI2030790,and MTM2025426 to Yang Zhang)used the Extreme Science and Engineering Discovery Environment(XSEDE),which is supported by the National Science Foundation,USA(Grant No.ACI1548562)。
文摘Gene Ontology(GO)has been widely used to annotate functions of genes and gene products.Here,we proposed a new method,Triplet GO,to deduce GO terms of protein-coding and noncoding genes,through the integration of four complementary pipelines built on transcript expression profile,genetic sequence alignment,protein sequence alignment,and naīve probability.Triplet GO was tested on a large set of 5754 genes from 8 species(human,mouse,Arabidopsis,rat,fly,budding yeast,fission yeast,and nematoda)and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge(CAFA3).Experimental results show that Triplet GO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches.Detailed analyses show that the major advantage of Triplet GO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique,which can accurately recognize function patterns from transcript expression profiles.Meanwhile,the combination of multiple complementary models,especially those from transcript expression and protein-level alignments,improves the coverage and accuracy of the final GO annotation results.The standalone package and an online server of Triplet GO are freely available at https://zhanggroup.org/Triplet GO/.
基金Project supported by the Key Program of Basic Research of Science & Technology Commission of Shanghai Municipality (No. 04dz14004) and the Shanghai Natural Science Foundation (No. 03ZR14065). Dedicated to Professor Xikui Jiang on the occasion of his 80th birthday.
文摘Clustering is perhaps one of the most widely used tools for microarray data analysis. Proposed roles for genes of unknown function are inferred from clusters of genes similarity expressed across many biological conditions. However, whether function annotation by similarity metrics is reliable or not and to what extent the similarity in gene expression patterns is useful for annotation of gene functions, has not been evaluated. This paper made a comprehensive research on the correlation between the similarity of expression data and of gene functions using Gene Ontology. It has been found that although the similarity in expression patterns and the similarity in gene functions are significantly dependent on each other, this association is rather weak. In addition, among the three categories of Gene Ontology, the similarity of expression data is more useful for cellular component annotation than for biological process and molecular function. The results presented are interesting for the gene functions prediction research area.
基金This work was supported by the Ph.D.Startup Foundation of Guizhou University of Traditional Chinese Medicine(no.(2020)32 and no.(2019)141)National Natural Science Foundation of China(no.31970629).
文摘Lonicera japonica Thunb.,a traditional Chinese herb,has been used for treating human diseases for thousands of years.Recently,the genome of L.japonica has been decoded,providing valuable information for research into gene function.However,no comprehensive database for gene functional analysis and mining is available for L.japonica.We therefore constructed LjaFGD(www.gzybioinformatics.cn/LjaFGD and bioinformatics.cau.edu.cn/LjaFGD),a database for analyzing and comparing gene function in L.japonica.We constructed a gene co-expression network based on 77 RNA-seq samples,and then annotated genes of L.japonica by alignment against protein sequences from public databases.We also introduced several tools for gene functional analysis,including Blast,motif analysis,gene set enrichment analysis,heatmap analysis,and JBrowse.Our co-expression network revealed that MYB and WRKY transcription factor family genes were co-expressed with genes encoding key enzymes in the biosynthesis of chlorogenic acid and luteolin in L.japonica.We used flavonol synthase 1(LjFLS1)as an example to show the reliability and applicability of our database.LjaFGD and its various associated tools will provide researchers with an accessible platform for retrieving functional information on L.japonica genes to further biological discovery.
基金supported by the National Key Research and Development Program of China(No.2016YFD0200-308)the National Key Basic Research Program of China(No.2015CB150501)the Project of Priority and Key Areas,Institute of Soil Science,Chinese Academy of Sciences(Nos.ISSASIP1605 and ISSASIP1640).
文摘Soil metaproteomics has excellent potential as a tool to elucidate the structural and functional changes in soil microbial communities in response to environmental alterations. However, soil metaproteomics is hindered by several challenges and gaps. Soil microbial communities possess extremely complex microbial composition, including many uncultured microorganisms without whole genome sequencing. Thus, how to select a suitable protein sequence database remains challenging in soil metaproteomics. In this study, the Public database and Meta-database were constructed using protein sequences from public databases and metagenomics, respectively. We comprehensively analyzed and compared the soil metaproteomic results using these two kinds of protein sequence databases for protein identification based on published soil metaproteomic raw data. The results demonstrated that many more proteins, higher sequence coverage, and even more microbial species and functional annotations could be identified using the Meta-database compared with those identified using the Public database. These findings indicated that the Meta-database was more specific as a protein sequence database. However, the follow-up in-depth metaproteomic analyses exhibited similar main results regardless of the database used. The microbial community composition at the genus level was similar between the two databases, especially the species annotations with high peptide-spectrum match and high abundance. The functional analyses in response to stress, such as the gene ontology enrichment of biological progress and molecular function and the key functional microorganisms, were also similar regardless of the database. Our analysis revealed that the Public database could also meet the demand to explore the functional responses of microbial proteins to some extent. This study provides valuable insights into the choice of protein sequence databases and their impacts on subsequent bioinformatic analysis in soil metaproteomic research and will facilitate the optimization of experimental design for different purposes.
基金supported by the National Natural Science Foundation of China(Grant Nos.32100519 and 31671373)the Scientific Research Foundation for Advanced Talents of Fujian Medical University(Grant No.XRCZX2021019)the XJTLU Key Program Special Fund(Grant Nos.KSF-T-01,KSF-E-51,and KSF-P-02),China.
文摘As the most pervasive epigenetic marker present on mRNAs and long non-coding RNAs(lncRNAs),N6-methyladenosine(m^(6)A)RNA methylation has been shown to participate in essential biological processes.Recent studies have revealed the distinct patterns of m^(6)A methylome across human tissues,and a major challenge remains in elucidating the tissue-specific presence and circuitry of m^(6)A methylation.We present here a comprehensive online platform,m^(6)A-TSHub,for unveiling the context-specific m^(6)A methylation and genetic mutations that potentially regulate m^(6)A epigenetic mark.m^(6)A-TSHub consists of four core components,including(1)m^(6)A-TSDB,a comprehensive database of 184,554 functionally annotated m^(6)A sites derived from 23 human tissues and 499,369 m^(6)A sites from 25 tumor conditions,respectively;(2)m^(6)A-TSFinder,a web server for high-accuracy prediction of m^(6)A methylation sites within a specific tissue from RNA sequences,which was constructed using multi-instance deep neural networks with gated attention;(3)m^(6)ATSVar,a web server for assessing the impact of genetic variants on tissue-specific m^(6)A RNA modifications;and(4)m^(6)A-CAVar,a database of 587,983 The Cancer Genome Atlas(TCGA)cancer mutations(derived from 27 cancer types)that were predicted to affect m^(6)A modifications in the primary tissue of cancers.The database should make a useful resource for studying the m^(6)A methylome and the genetic factors of epitranscriptome disturbance in a specific tissue(or cancer type).m^(6)A-TSHub is accessible at www.xjtlu.edu.cn/biologicalsciences/m^(6)ats.
基金supported by grants from the National Key R&D Program of China(Grant No.2021YFA1302100 to Ze-Xian Liu)the National Natural Science Foundation of China(Grant No.U2004152 to Zhenlong Wang,Grant Nos.81972239 and 91953123 to Ze-Xian Liu)+2 种基金the Fostering Fund of Fundamental Research for Young Teachers of Zhengzhou University,China(Grant No.JC21343016 to Han Cheng)the Program for Guangdong Introducing Innovative and Entrepreneurial Teams,China(Grant No.2017ZT07S096 to Ze-Xian Liu)the Tip-Top Scientific and Technical Innovative Youth Talents of Guangdong Special Support Program,China(Grant No.2019TQ05Y351 to Ze-Xian Liu).
文摘Mammals have evolved mechanisms to sense hypoxia and induce hypoxic responses.Recently,high-throughput techniques have greatly promoted global studies of protein expression changes during hypoxia and the identification of candidate genes associated with hypoxiaadaptive evolution,which have contributed to the understanding of the complex regulatory networks of hypoxia.In this study,we developed an integrated resource for the expression dynamics of proteins in response to hypoxia(iHypoxia),and this database contains 2589 expression events of 1944 proteins identified by low-throughput experiments(LTEs)and 422,553 quantitative expression events of 33,559 proteins identified by high-throughput experiments from five mammals that exhibit a response to hypoxia.Various experimental details,such as the hypoxic experimental conditions,expression patterns,and sample types,were carefully collected and integrated.Furthermore,8788 candidate genes from diverse species inhabiting low-oxygen environments were also integrated.In addition,we conducted an orthologous search and computationally identified 394,141 proteins that may respond to hypoxia among 48 animals.An enrichment analysis of human proteins identified from LTEs shows that these proteins are enriched in certain drug targets and cancer genes.Annotation of known posttranslational modification(PTM)sites in the proteins identified by LTEs reveals that these proteins undergo extensive PTMs,particularly phosphorylation,ubiquitination,and acetylation.iHypoxia provides a convenient and user-friendly method for users to obtain hypoxia-related information of interest.
基金supported by grants from the National Natural Science Foundation of China(Grant Nos.31970629 and 31771467 to ZS,and 31870209 to YJ).
文摘Genetic and epigenetic changes after polyploidization events could result in variable gene expression and modified regulatory networks.Here,using large-scale transcriptome data,we constructed co-expression networks for diploid,tetraploid,and hexaploid wheat species,and built a platform for comparing co-expression networks of allohexaploid wheat and its progenitors,named WheatCENet.WheatCENet is a platform for searching and comparing specific functional coexpression networks,as well as identifying the related functions of the genes clustered therein.Functional annotations like pathways,gene families,protein-protein interactions,microRNAs(miRNAs),and several lines of epigenome data are integrated into this platform,and Gene Ontology(GO)annotation,gene set enrichment analysis(GSEA),motif identification,and other useful tools are also included.Using WheatCENet,we found that the network of WHEAT ABERRANT PANICLE ORGANIZATION I(WAPOI)has more co-expressed genes related to spike development in hexaploid wheat than its progenitors.We also found a novel motif of CCWWWWWWGG(CArG)specifically in the promoter region of WAPO-Al,suggesting that neofunctionalization of the WAPO-AI gene affects spikelet development in hexaploid wheat.WheatCENet is useful for investigating co-expression networks and conducting other analyses,and thus facilitates comparative and functional genomic studies in wheat.
基金This work was fully funded by Sarawak Research and Development Council through the Research Initiation Grant Scheme with grant number RDCRG/RIF/2019/13 awarded to H.H.Chung.
文摘The Malaysian mahseer(Tor tambroides),one of the most valuable freshwater fish in the world,is mainly targeted for human consumption.The mitogenomic data of this species is available to date,but the genomic information is still lacking.For the first time,we sequenced the whole genome of an adult fish on both Illumina and Nanopore platforms.The hybrid genome assembly had resulted in a sum of 1.23 Gb genomic sequence from the 44,726 contigs found with 44 kb N50 length and BUSCO genome completeness of 87.6%.Four types of SSRs had been detected and identified within the genome with a greater AT abundance than that of GC.Predicted protein sequences had been functionally annotated to public databases,namely GO,KEGG and COG.A maximum likelihood phylogenomic tree containing 52 Actinopterygii species and one Sarcopterygii species as outgroup was constructed,providing first insights into the genome-based evolutionary relationship of T.tambroides with other ray-finned fish.These data are crucial in facilitating the study of population genomics,species identification,morphological variations,and evolutionary biology,which are helpful in the conservation of this species.
基金supported in part by the National Natural Science Foundation of China (31000564,31071137,91229120)the Beijing Natural Science Foundation (5122029)the Knowledge Innovation Program of the Chinese Academy of Sciences (KSCX2-EW-R-01)
文摘Eukaryotic mRNAs consist of two forms of transcripts:poly(A)+ and poly(A),based on the presence or absence of poly(A) tails at the 3 end.Poly(A)+ mRNAs are mainly protein coding mRNAs,whereas the functions of poly(A) mRNA are largely unknown.Previous studies have shown that a significant proportion of gene transcripts are poly(A) or bimorphic(containing both poly(A)+ and poly(A) transcripts).We compared the expression levels of poly(A) and poly(A)+ RNA mRNAs in normal and cancer cell lines.We also investigated the potential functions of these RNA transcripts using an integrative workflow to explore poly(A)+ and poly(A) transcriptome sequences between a normal human mammary gland cell line(HMEC) and a breast cancer cell line(MCF-7),as well as between a normal human lung cell line(NHLF) and a lung cancer cell line(A549).The data showed that normal and cancer cell lines differentially express these two forms of mRNA.Gene ontology(GO) annotation analyses hinted at the functions of these two groups of transcripts and grouped the differentially expressed genes according to the form of their transcript.The data showed that cell cycle-,apoptosis-,and cell death-related functions corresponded to most of the differentially expressed genes in these two forms of transcripts,which were also associated with the cancers.Furthermore,translational elongation and translation functions were also found for the poly(A) protein-coding genes in cancer cell lines.We demonstrate that poly(A) transcripts play an important role in cancer development.
基金supported by Beijing Natural Science Foundation(5122029)
文摘The functional impact of several long intergenic non-coding RNAs (lincRNAs) has been characterized in previous studies. However, it is difficult to identify lincRNAs on a large-scale and to ascertain their functions or predict their structures in laboratory experiments because of the diversity, lack of knowledge and specificity of expression of lincRNAs. Furthermore, although there are a few well-characterized examples of lincRNAs associated with cancers, these are just the tip of the iceberg owing to the complexity of cancer. Here, by combining RNA-Seq data from several kinds of human cell lines with chromatin-state maps and human expressed sequence tags, we successfully identified more than 3000 human lincRNAs, most of which were new ones. Subsequently, we predicted the functions of 105 lincRNAs based on a coding-non-coding gene co-expression network. Finally, we propose a genetic mediator and key regulator model to unveil the subtle relationships between lincRNAs and lung cancer. Twelve lincRNAs may be principal players in lung tumorigenesis. The present study combines large-scale identification and functional prediction of human lincRNAs, and is a pioneering work in characterizing cancer-associated lincRNAs by bioinformatics.