A new chaos game representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model has been proposed by Yu et al (Physica A 337(2004) 171). A CGR-walk model is proposed based on the ne...A new chaos game representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model has been proposed by Yu et al (Physica A 337(2004) 171). A CGR-walk model is proposed based on the new CGR coordinates for the protein sequences from complete genomes in the present paper. The new CCR coordinates based on the detailed HP model are converted into a time series, and a long-memory ARFIMA(p, d, q) model is introduced into the protein sequence analysis. This model is applied to simulating real CCR-walk sequence data of twelve protein sequences. Remarkably long-range correlations are uncovered in the data and the results obtained from these models are reasonably consistent with those available from the ARFIMA(p, d, q) model.展开更多
Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos gam...Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos game representation (CGR) of randomly-linked functional protein sequences, then propose the use of the recurrent iterated function systems (RIFS) in fractal theory to simulate the measure based on their chaos game representations. This method helps to extract some features of functional protein sequences, and furthermore the biological functions of these proteins. Then multifractal analysis of the measures based on the CGRs of randomly-linked functional protein sequences are performed. We find that the CGRs have clear fractal patterns. The numerical results show that the RIFS can simulate the measure based on the CGR very well. The relative standard error and the estimated probability matrix in the RIFS do not depend on the order to link the functional protein sequences. The estimated probability matrices in the RIFS with different biological functions are evidently different. Hence the estimated probability matrices in the RIFS can be used to characterise the difference among linked functional protein sequences with different biological functions. From the values of the Dq curves, one sees that these functional protein sequences are not completely random. The Dq of all linked functional proteins studied are multifractal-like and sufficiently smooth for the Cq (analogous to specific heat) curves to be meaningful. Furthermore, the Dq curves of the measure μ based on their CCRs for different orders to link the functional protein sequences are almost identical if q 〉 0. Finally, the Ca curves of all linked functional proteins resemble a classical phase transition at a critical point.展开更多
The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and...The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.展开更多
Comparison between different biological sequences is a key step in bioinformatics when analyzing similarities of sequences and phylogenetic relationships. A method of graphically representing biological sequences know...Comparison between different biological sequences is a key step in bioinformatics when analyzing similarities of sequences and phylogenetic relationships. A method of graphically representing biological sequences known as Chaos Game Representation (CGR) has achieved many applications in the studies of bioinformatics. The key issue in the application of CGR is to extract as many useful features as possible from CGR. Initially, CGR was applied to DNA sequences, but in this paper, a CGR-based approach is used to extract suitable features for comparing protein sequences of SARS-CoV-2 and other viruses. For this aim, several viral protein sequences from 12 groups are considered and CGR centroid, amino acid frequency, compounded frequency, Shannon entropy, and Kullback-Lieber Discrimination Information are applied to find the inter-relationship among the sequences. The experimental results demonstrate the potential strengths of CGR-based method for examining the evolutionary relationship of protein sequences. Our method is powerful for extracting effective features from protein sequences, and therefore important in classifying proteins and inferring the phylogeny of viruses.展开更多
Chaos game representation (CGR) is proposed as a scale-independent representation for DNA sequences and provides information about the statistical distribution of oligonucleotides in a DNA sequence. CGR images of DN...Chaos game representation (CGR) is proposed as a scale-independent representation for DNA sequences and provides information about the statistical distribution of oligonucleotides in a DNA sequence. CGR images of DNA sequences represent some kinds of fractal patterns, but the common multifractal analysis based on the box counting method cannot deal with CGR images perfectly. Here, the wavelet transform modulus maxima (WTMM) method is applied to the multifractal analysis of CGR images. The results show that the scale-invariance range of CGR edge images can be extended to three orders of magnitude, and complete singularity spectra can be calculated. Spectrum parameters such as the singularity spectrum span are extracted to describe the statistical character of DNA sequences. Compared with the singularity spectrum span, exon sequences with a minimal spectrum span have the most uniform fractal structure. Also, the singularity spectrum parameters are related to oligonueleotide length, sequence component and species, thereby providing a method of studying the length polymorphism of repeat oligonucleotides.展开更多
Over the course of human history, influenza pandemics have been seen as major disasters, so studies on the influenza virus have become an important issue for many experts and scholars. Comprehensive research has been ...Over the course of human history, influenza pandemics have been seen as major disasters, so studies on the influenza virus have become an important issue for many experts and scholars. Comprehensive research has been performed over the years on the biological properties, chemical characteristics, external environmental factors and other aspects of the virus, and some results have been achieved. Based on the chaos game representation walk model, this paper uses the time series analysis method to study the DNA sequences of the influenza virus from 1913 to 2010, and works out the early-warning signals indicator value for the outbreak of an influenza pandemic. The variances in the CCR wall〈 sequences for the pandemic years (or + -1 to 2 years) are significantly higher than those for the adjacent years, while those in the non-pandemic years are usually smaller. In this way we can provide an influenza early-warning mechanism so that people can take precautions and be well prepared prior to a pandemic.展开更多
Knowledge of the evolution of pathogens is of great medical and biological significance to the prevention, diagnosis, and therapy of infectious diseases. In order to understand the origin and evolution of the SARS-CoV...Knowledge of the evolution of pathogens is of great medical and biological significance to the prevention, diagnosis, and therapy of infectious diseases. In order to understand the origin and evolution of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus), we collected complete genome sequences of all viruses available in GenBank, and made comparative analyses with the SARS-CoV. Genomic signature analysis demonstrates that the coronaviruses all take the TGTT as their richest tetranucleotide except the SARS-CoV. A detailed analysis of the forty-two complete SARS-CoV genome sequences revealed the existence of two distinct genotypes, and showed that these isolates could be classified into four groups. Our manual analysis of the BLASTN results demonstrates that the HE (hemagglutinin-esterase) gene exists in the SARS-CoV, and many mutations made it unfamiliar to us.展开更多
基金Project supported by the National Natural Science Foundation of China (Grant No 60575038)the Natural Science Foundation of Jiangnan University, China (Grant No 20070365)the Program for Innovative Research Team of Jiangnan University, China
文摘A new chaos game representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model has been proposed by Yu et al (Physica A 337(2004) 171). A CGR-walk model is proposed based on the new CGR coordinates for the protein sequences from complete genomes in the present paper. The new CCR coordinates based on the detailed HP model are converted into a time series, and a long-memory ARFIMA(p, d, q) model is introduced into the protein sequence analysis. This model is applied to simulating real CCR-walk sequence data of twelve protein sequences. Remarkably long-range correlations are uncovered in the data and the results obtained from these models are reasonably consistent with those available from the ARFIMA(p, d, q) model.
基金Project partially supported by the National Natural Science Foundation of China (Grant No.30570426)the Chinese Program for New Century Excellent Talents in University (Grant No.NCET-08-06867)+1 种基金Fok Ying Tung Education Foundation (Grant No.101004)Australian Research Council (Grant No.DP0559807)
文摘Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos game representation (CGR) of randomly-linked functional protein sequences, then propose the use of the recurrent iterated function systems (RIFS) in fractal theory to simulate the measure based on their chaos game representations. This method helps to extract some features of functional protein sequences, and furthermore the biological functions of these proteins. Then multifractal analysis of the measures based on the CGRs of randomly-linked functional protein sequences are performed. We find that the CGRs have clear fractal patterns. The numerical results show that the RIFS can simulate the measure based on the CGR very well. The relative standard error and the estimated probability matrix in the RIFS do not depend on the order to link the functional protein sequences. The estimated probability matrices in the RIFS with different biological functions are evidently different. Hence the estimated probability matrices in the RIFS can be used to characterise the difference among linked functional protein sequences with different biological functions. From the values of the Dq curves, one sees that these functional protein sequences are not completely random. The Dq of all linked functional proteins studied are multifractal-like and sufficiently smooth for the Cq (analogous to specific heat) curves to be meaningful. Furthermore, the Dq curves of the measure μ based on their CCRs for different orders to link the functional protein sequences are almost identical if q 〉 0. Finally, the Ca curves of all linked functional proteins resemble a classical phase transition at a critical point.
文摘The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.
文摘Comparison between different biological sequences is a key step in bioinformatics when analyzing similarities of sequences and phylogenetic relationships. A method of graphically representing biological sequences known as Chaos Game Representation (CGR) has achieved many applications in the studies of bioinformatics. The key issue in the application of CGR is to extract as many useful features as possible from CGR. Initially, CGR was applied to DNA sequences, but in this paper, a CGR-based approach is used to extract suitable features for comparing protein sequences of SARS-CoV-2 and other viruses. For this aim, several viral protein sequences from 12 groups are considered and CGR centroid, amino acid frequency, compounded frequency, Shannon entropy, and Kullback-Lieber Discrimination Information are applied to find the inter-relationship among the sequences. The experimental results demonstrate the potential strengths of CGR-based method for examining the evolutionary relationship of protein sequences. Our method is powerful for extracting effective features from protein sequences, and therefore important in classifying proteins and inferring the phylogeny of viruses.
基金Project supported by the Science and Technology Commission of Shanghai Municipality (Grant No. 05DZ19747)the National Basic Research Program of China (Grant No. 2006CB504509)
文摘Chaos game representation (CGR) is proposed as a scale-independent representation for DNA sequences and provides information about the statistical distribution of oligonucleotides in a DNA sequence. CGR images of DNA sequences represent some kinds of fractal patterns, but the common multifractal analysis based on the box counting method cannot deal with CGR images perfectly. Here, the wavelet transform modulus maxima (WTMM) method is applied to the multifractal analysis of CGR images. The results show that the scale-invariance range of CGR edge images can be extended to three orders of magnitude, and complete singularity spectra can be calculated. Spectrum parameters such as the singularity spectrum span are extracted to describe the statistical character of DNA sequences. Compared with the singularity spectrum span, exon sequences with a minimal spectrum span have the most uniform fractal structure. Also, the singularity spectrum parameters are related to oligonueleotide length, sequence component and species, thereby providing a method of studying the length polymorphism of repeat oligonucleotides.
基金Project supported by the Fundamental Research Funds for the Central Universities (Grant No. JUSRP21117)the Program for Innovative Research Team of Jiangnan University (Grant No. 2008CX002)
文摘Over the course of human history, influenza pandemics have been seen as major disasters, so studies on the influenza virus have become an important issue for many experts and scholars. Comprehensive research has been performed over the years on the biological properties, chemical characteristics, external environmental factors and other aspects of the virus, and some results have been achieved. Based on the chaos game representation walk model, this paper uses the time series analysis method to study the DNA sequences of the influenza virus from 1913 to 2010, and works out the early-warning signals indicator value for the outbreak of an influenza pandemic. The variances in the CCR wall〈 sequences for the pandemic years (or + -1 to 2 years) are significantly higher than those for the adjacent years, while those in the non-pandemic years are usually smaller. In this way we can provide an influenza early-warning mechanism so that people can take precautions and be well prepared prior to a pandemic.
文摘Knowledge of the evolution of pathogens is of great medical and biological significance to the prevention, diagnosis, and therapy of infectious diseases. In order to understand the origin and evolution of the SARS-CoV (severe acute respiratory syndrome-associated coronavirus), we collected complete genome sequences of all viruses available in GenBank, and made comparative analyses with the SARS-CoV. Genomic signature analysis demonstrates that the coronaviruses all take the TGTT as their richest tetranucleotide except the SARS-CoV. A detailed analysis of the forty-two complete SARS-CoV genome sequences revealed the existence of two distinct genotypes, and showed that these isolates could be classified into four groups. Our manual analysis of the BLASTN results demonstrates that the HE (hemagglutinin-esterase) gene exists in the SARS-CoV, and many mutations made it unfamiliar to us.