In the study of motif discovery, especially the transcription factor DNA binding sites discovery, a too long input sequence would return non-informative motifs rather than those biological functional motifs. This pape...In the study of motif discovery, especially the transcription factor DNA binding sites discovery, a too long input sequence would return non-informative motifs rather than those biological functional motifs. This paper gave theoretical analyses and computational experiments to suggest the length limits of the input sequence. When the sequence length exceeds a certain critical point, the probability of discovering the motif decreases sharply. The work not only gave an explanation on the unsatisfying results of the existed motif discovery problems that the input sequence length might be too long and exceed the point, but also provided an estimation of input sequence length we should accept to get more meaningful and reliable results in motif discovery.展开更多
MicroRNAs (miRNAs) are short (~21 nt) nucleotide sequences that are either co-transcribed during the production of mRNA or are organized in intergenic regions transcribed by RNA polymerase II. In animals, Drosha, and ...MicroRNAs (miRNAs) are short (~21 nt) nucleotide sequences that are either co-transcribed during the production of mRNA or are organized in intergenic regions transcribed by RNA polymerase II. In animals, Drosha, and in plants DCL1 recognize pre-miRNAs which set themselves apart by their characteristic stem loop (hairpin) structure. This structure appears important for their recognition during the process of maturation leading to functioning mature miRNAs. A large body of research is available for computational pre-miRNA detection in animals, but less within the plant kingdom. For the prediction of pre-miRNAs, usually machine learning approaches are employed. Therefore, it is necessary to convert the pre-miRNAs into a set of features that can be calculated and many such features have been described. We here select a subset of the previously described features and add sequence motifs as new features. The resulting model which we called MotifmiRNAPred was tested on known pre-miRNAs listed in miRBase and its accuracy was compared to existing approaches in the field. With an accuracy of 99.95% for the generalized plant model, it distinguishes itself from previously published results which reach an average accuracy between 74% and 98%. We believe that our approach is useful for prediction of pre-miRNAs in plants without per species adjustment.展开更多
An open reading frame (lcn61) of lymphocystis disease virus China (LCDV-cn), probably responsible for encoding putative zinc-finger proteins was amplified and inserted into pET24a (+) vector. Then it expressed in E. c...An open reading frame (lcn61) of lymphocystis disease virus China (LCDV-cn), probably responsible for encoding putative zinc-finger proteins was amplified and inserted into pET24a (+) vector. Then it expressed in E. coli BL21 (DE3), and His-tag fusion protein of high yield was obtained. It was found that the fusion protein existed in E. coli mainly as inclusion bodies. The bioinformatics analysis indicates that LCN61 is C2H2 type zinc-finger protein containing four C2H2 zinc-finger motifs. This work provides a theory for functional research of lcn61 gene.展开更多
The functionality of a gene or a protein depends on codon repeats occurring in it.As a consequence of their vitality in protein function and apparent involvement in causing diseases,an interest in these repeats has de...The functionality of a gene or a protein depends on codon repeats occurring in it.As a consequence of their vitality in protein function and apparent involvement in causing diseases,an interest in these repeats has developed in recent years.The analysis of genomic and proteomic sequences to identify such repeats requires some algorithmic support from informatics level.Here,we proposed an offline stand-alone toolkit Repeat Searcher and Motif Detector(RSMD),which uncovers and employs few novel approaches in identification of sequence repeats and motifs to understand their functionality in sequence level and their disease causing tendency.The tool offers various features such as identifying motifs,repeats and identification of disease causing repeats.RSMD was designed to provide an easily understandable graphical user interface(GUI),for the tool will be predominantly accessed by biologists and various researchers in all platforms of life science.GUI was developed using the scripting language Perl and its graphical module PerlTK.RSMD covers algorithmic foundations of computational biology by combining theory with practice.展开更多
Recently non-coding RNA (ncRNA) genes have been found to serve many important functions in the cell such as regulation of gene expression at the transcriptional level. Potentially there are more ncRNA molecules yet ...Recently non-coding RNA (ncRNA) genes have been found to serve many important functions in the cell such as regulation of gene expression at the transcriptional level. Potentially there are more ncRNA molecules yet to be found and their possible functions are to be revealed. The discovery of ncRNAs is a difficult task because they lack sequence indicators such as the start and stop codons dis- played by protein-coding RNAs. Current methods utilize either sequence motifs or structural parameters to detect novel ncRNAs within genomes. Here, we present an ab initio ncRNA finder, named ncRNAscout, by utilizing both sequence motifs and structural parameters. Specifically, our method has three components: (i) a measure of the frequency of a sequence, (ii) a measure of the structural stability of a sequence contained in a t-score, and (iii) a measure of the frequency of certain patterns within a sequence that may indicate the presence of ncRNA. Experimental results show that, given a genome and a set of known ncRNAs, our method is able to accurately identify and locate a significant number of ncRNA sequences in the genome. The ncRNAscout tool is available for downloading at http:/]bioinfor- matics.njit.edu/ncRNAscout.展开更多
Linking similar proteins structurally is a challenging task that may help in finding the novel members of a protein family. In this respect, identification of conserved sequence can facilitate understanding and classi...Linking similar proteins structurally is a challenging task that may help in finding the novel members of a protein family. In this respect, identification of conserved sequence can facilitate understanding and classifying the exact role of proteins. However, the exact role of these conserved elements cannot be elucidated without structural and physiochemical information. In this work, we present a novel desktop application MotViz designed for searching and analyzing the conserved sequence segments within protein structure. With MotViz, the user can extract a complete list of sequence motifs from loaded 3D structures, annotate the motifs structurally and analyze their physiochemical properties. The conservation value calculated for an individual motif can be visualized graphically. To check the efficiency, predicted motifs from the data sets of 9 protein families were analyzed and Mot^z algorithm was more efficient in comparison to other online motif prediction tools. Furthermore, a database was also integrated for storing, retrieving and performing the detailed functional annotation studies. In summary, MotViz effectively predicts motifs with high sensitivity and simultaneously visualizes them into 3D strucures. Moreover, Mot- V/z is user-friendly with optimized graphical parameters and better processing speed due to the inclusion of a database at the back end. MotViz is available at http://www.fi-pk.corn/motviz.html.展开更多
The function of a protein molecule is greatly influenced by its three-dimensional (3D) structure and therefore structure prediction will help identify its biological function. We have updated Sequence, Motif and Str...The function of a protein molecule is greatly influenced by its three-dimensional (3D) structure and therefore structure prediction will help identify its biological function. We have updated Sequence, Motif and Structure (SMS), the database of structurally rigid peptide fragments, by combining amino acid sequences and the corre- sponding 3D atomic coordinates of non-redundant (25%) and redundant (90%) protein chains available in the Protein Data Bank (PDB). SMS 2.0 provides information pertaining to the peptide fragments of length 5-14 resi- dues. The entire dataset is divided into three categories, namely, same sequence motifs having similar, intermedi- ate or dissimilar 3D structures. Further, options are provided to facilitate structural superposition using the pro- gram structural alignment of multiple proteins (STAMP) and the popular JAVA plug-in (Jmol) is deployed for visualization. In addition, functionalities are provided to search for the occurrences of the sequence motifs in other structural and sequence databases like PDB, Genome Database (GDB), Protein Information Resource (PIR) and Swiss-Prot. The updated database along with the search engine is available over the World Wide Web through the following URL http://cluster.physics.iisc.ernet.in/sms/.展开更多
Pseudouridine(Ψ)is the most prevalent post-transcriptional RNA modification and is widespread in small cellular RNAs and m RNAs.However,the functions,mechanisms,and precise distribution ofΨs(especially in m RNAs)sti...Pseudouridine(Ψ)is the most prevalent post-transcriptional RNA modification and is widespread in small cellular RNAs and m RNAs.However,the functions,mechanisms,and precise distribution ofΨs(especially in m RNAs)still remain largely unclear.The landscape ofΨs across the transcriptome has not yet been fully delineated.Here,we present a highly effective model based on a convolutional neural network(CNN),called Pseudo Uridy Lation Site Estimator(PULSE),to analyze large-scale profiling data ofΨsites and characterize the contextual sequence features of pseudouridylation.PULSE,consisting of two alternatively-stacked convolution and pooling layers followed by a fully-connected neural network,can automatically learn the hidden patterns of pseudouridylation from the local sequence information.Extensive validation tests demonstrated that PULSE can outperform other state-of-the-art prediction methods and achieve high prediction accuracy,thus enabling us to further characterize the transcriptome-wide landscape ofΨsites.We further showed that the prediction results derived from PULSE can provide novel insights into understanding the functional roles of pseudouridylation,such as the regulations of RNA secondary structure,codon usage,translation,and RNA stability,and the connection to single nucleotide variants.The source code and final model for PULSE are available at https://github.com/mlcb-thu/PULSE.展开更多
The Streptomyces phage φC31 integrase can efficiently target attB-bearing transgenes to endogenous pseudo attP sites within mammalian genomes. To better understand the activity of φC31 integrase in the bovine genome...The Streptomyces phage φC31 integrase can efficiently target attB-bearing transgenes to endogenous pseudo attP sites within mammalian genomes. To better understand the activity of φC31 integrase in the bovine genome, DNA sequences of 44 integration events were analyzed, and 32 pseudo attP sites were identified. The majority of these sites share a sequence motif that contains inverted repeats and has similarities to wild-type attP site. Genomic DNA flanking these sites typically contained repetitive sequence elements, such as short and long interspersed repetitive elements. These sequence features indicate that DNA sequence recognition plays an important role in guiding φC31-mediated site-specific integration. In addition, BF27 integration hotspot sites were identified in the bovine genome, which accounted for 13.6% of all isolated integration events and mapped to an intron of the deleted in liver cancer 1 (DLC1) gene. Also we found that the pseudo attP sites in the bovine genome had other features in common with those in the human genome. This study represents the first time that the sequence features of pseudo attP sites specific integrase system has great potential for applied modifications in the bovine genome were analyzed. We conclude that this site- of the bovine genome.展开更多
文摘In the study of motif discovery, especially the transcription factor DNA binding sites discovery, a too long input sequence would return non-informative motifs rather than those biological functional motifs. This paper gave theoretical analyses and computational experiments to suggest the length limits of the input sequence. When the sequence length exceeds a certain critical point, the probability of discovering the motif decreases sharply. The work not only gave an explanation on the unsatisfying results of the existed motif discovery problems that the input sequence length might be too long and exceed the point, but also provided an estimation of input sequence length we should accept to get more meaningful and reliable results in motif discovery.
文摘MicroRNAs (miRNAs) are short (~21 nt) nucleotide sequences that are either co-transcribed during the production of mRNA or are organized in intergenic regions transcribed by RNA polymerase II. In animals, Drosha, and in plants DCL1 recognize pre-miRNAs which set themselves apart by their characteristic stem loop (hairpin) structure. This structure appears important for their recognition during the process of maturation leading to functioning mature miRNAs. A large body of research is available for computational pre-miRNA detection in animals, but less within the plant kingdom. For the prediction of pre-miRNAs, usually machine learning approaches are employed. Therefore, it is necessary to convert the pre-miRNAs into a set of features that can be calculated and many such features have been described. We here select a subset of the previously described features and add sequence motifs as new features. The resulting model which we called MotifmiRNAPred was tested on known pre-miRNAs listed in miRBase and its accuracy was compared to existing approaches in the field. With an accuracy of 99.95% for the generalized plant model, it distinguishes itself from previously published results which reach an average accuracy between 74% and 98%. We believe that our approach is useful for prediction of pre-miRNAs in plants without per species adjustment.
基金Supported by High Technology Research and Development Program of China (863 Program, No. 2006AA100309)
文摘An open reading frame (lcn61) of lymphocystis disease virus China (LCDV-cn), probably responsible for encoding putative zinc-finger proteins was amplified and inserted into pET24a (+) vector. Then it expressed in E. coli BL21 (DE3), and His-tag fusion protein of high yield was obtained. It was found that the fusion protein existed in E. coli mainly as inclusion bodies. The bioinformatics analysis indicates that LCN61 is C2H2 type zinc-finger protein containing four C2H2 zinc-finger motifs. This work provides a theory for functional research of lcn61 gene.
文摘The functionality of a gene or a protein depends on codon repeats occurring in it.As a consequence of their vitality in protein function and apparent involvement in causing diseases,an interest in these repeats has developed in recent years.The analysis of genomic and proteomic sequences to identify such repeats requires some algorithmic support from informatics level.Here,we proposed an offline stand-alone toolkit Repeat Searcher and Motif Detector(RSMD),which uncovers and employs few novel approaches in identification of sequence repeats and motifs to understand their functionality in sequence level and their disease causing tendency.The tool offers various features such as identifying motifs,repeats and identification of disease causing repeats.RSMD was designed to provide an easily understandable graphical user interface(GUI),for the tool will be predominantly accessed by biologists and various researchers in all platforms of life science.GUI was developed using the scripting language Perl and its graphical module PerlTK.RSMD covers algorithmic foundations of computational biology by combining theory with practice.
基金supported in part by US National Science Foundation (Grant No. IIS-0707571)
文摘Recently non-coding RNA (ncRNA) genes have been found to serve many important functions in the cell such as regulation of gene expression at the transcriptional level. Potentially there are more ncRNA molecules yet to be found and their possible functions are to be revealed. The discovery of ncRNAs is a difficult task because they lack sequence indicators such as the start and stop codons dis- played by protein-coding RNAs. Current methods utilize either sequence motifs or structural parameters to detect novel ncRNAs within genomes. Here, we present an ab initio ncRNA finder, named ncRNAscout, by utilizing both sequence motifs and structural parameters. Specifically, our method has three components: (i) a measure of the frequency of a sequence, (ii) a measure of the structural stability of a sequence contained in a t-score, and (iii) a measure of the frequency of certain patterns within a sequence that may indicate the presence of ncRNA. Experimental results show that, given a genome and a set of known ncRNAs, our method is able to accurately identify and locate a significant number of ncRNA sequences in the genome. The ncRNAscout tool is available for downloading at http:/]bioinfor- matics.njit.edu/ncRNAscout.
基金supported by Higher Education Commission, Pakistan (Grants No. 20-1493/R&D/09)
文摘Linking similar proteins structurally is a challenging task that may help in finding the novel members of a protein family. In this respect, identification of conserved sequence can facilitate understanding and classifying the exact role of proteins. However, the exact role of these conserved elements cannot be elucidated without structural and physiochemical information. In this work, we present a novel desktop application MotViz designed for searching and analyzing the conserved sequence segments within protein structure. With MotViz, the user can extract a complete list of sequence motifs from loaded 3D structures, annotate the motifs structurally and analyze their physiochemical properties. The conservation value calculated for an individual motif can be visualized graphically. To check the efficiency, predicted motifs from the data sets of 9 protein families were analyzed and Mot^z algorithm was more efficient in comparison to other online motif prediction tools. Furthermore, a database was also integrated for storing, retrieving and performing the detailed functional annotation studies. In summary, MotViz effectively predicts motifs with high sensitivity and simultaneously visualizes them into 3D strucures. Moreover, Mot- V/z is user-friendly with optimized graphical parameters and better processing speed due to the inclusion of a database at the back end. MotViz is available at http://www.fi-pk.corn/motviz.html.
基金supported by a research grant from the Department of Information Technology (DIT) awarded to KS
文摘The function of a protein molecule is greatly influenced by its three-dimensional (3D) structure and therefore structure prediction will help identify its biological function. We have updated Sequence, Motif and Structure (SMS), the database of structurally rigid peptide fragments, by combining amino acid sequences and the corre- sponding 3D atomic coordinates of non-redundant (25%) and redundant (90%) protein chains available in the Protein Data Bank (PDB). SMS 2.0 provides information pertaining to the peptide fragments of length 5-14 resi- dues. The entire dataset is divided into three categories, namely, same sequence motifs having similar, intermedi- ate or dissimilar 3D structures. Further, options are provided to facilitate structural superposition using the pro- gram structural alignment of multiple proteins (STAMP) and the popular JAVA plug-in (Jmol) is deployed for visualization. In addition, functionalities are provided to search for the occurrences of the sequence motifs in other structural and sequence databases like PDB, Genome Database (GDB), Protein Information Resource (PIR) and Swiss-Prot. The updated database along with the search engine is available over the World Wide Web through the following URL http://cluster.physics.iisc.ernet.in/sms/.
基金supported in part by the National Natural Science Foundation of China(Grant Nos.61472205 and 81630103)the US National Science Foundation(Grant Nos.DBI-1262107 and IIS-1646333)+1 种基金the China’s Youth 1000Talent Programthe Beijing Advanced Innovation Center for Structural Biology。
文摘Pseudouridine(Ψ)is the most prevalent post-transcriptional RNA modification and is widespread in small cellular RNAs and m RNAs.However,the functions,mechanisms,and precise distribution ofΨs(especially in m RNAs)still remain largely unclear.The landscape ofΨs across the transcriptome has not yet been fully delineated.Here,we present a highly effective model based on a convolutional neural network(CNN),called Pseudo Uridy Lation Site Estimator(PULSE),to analyze large-scale profiling data ofΨsites and characterize the contextual sequence features of pseudouridylation.PULSE,consisting of two alternatively-stacked convolution and pooling layers followed by a fully-connected neural network,can automatically learn the hidden patterns of pseudouridylation from the local sequence information.Extensive validation tests demonstrated that PULSE can outperform other state-of-the-art prediction methods and achieve high prediction accuracy,thus enabling us to further characterize the transcriptome-wide landscape ofΨsites.We further showed that the prediction results derived from PULSE can provide novel insights into understanding the functional roles of pseudouridylation,such as the regulations of RNA secondary structure,codon usage,translation,and RNA stability,and the connection to single nucleotide variants.The source code and final model for PULSE are available at https://github.com/mlcb-thu/PULSE.
基金supported by the grants from the National Science and Technology Major Project of China(Nos. 2009ZX08010-018B and 2011ZX08007-004)State & Shanghai Leading Academic Discipline(B204)
文摘The Streptomyces phage φC31 integrase can efficiently target attB-bearing transgenes to endogenous pseudo attP sites within mammalian genomes. To better understand the activity of φC31 integrase in the bovine genome, DNA sequences of 44 integration events were analyzed, and 32 pseudo attP sites were identified. The majority of these sites share a sequence motif that contains inverted repeats and has similarities to wild-type attP site. Genomic DNA flanking these sites typically contained repetitive sequence elements, such as short and long interspersed repetitive elements. These sequence features indicate that DNA sequence recognition plays an important role in guiding φC31-mediated site-specific integration. In addition, BF27 integration hotspot sites were identified in the bovine genome, which accounted for 13.6% of all isolated integration events and mapped to an intron of the deleted in liver cancer 1 (DLC1) gene. Also we found that the pseudo attP sites in the bovine genome had other features in common with those in the human genome. This study represents the first time that the sequence features of pseudo attP sites specific integrase system has great potential for applied modifications in the bovine genome were analyzed. We conclude that this site- of the bovine genome.