A new chaos game representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model has been proposed by Yu et al (Physica A 337(2004) 171). A CGR-walk model is proposed based on the ne...A new chaos game representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model has been proposed by Yu et al (Physica A 337(2004) 171). A CGR-walk model is proposed based on the new CGR coordinates for the protein sequences from complete genomes in the present paper. The new CCR coordinates based on the detailed HP model are converted into a time series, and a long-memory ARFIMA(p, d, q) model is introduced into the protein sequence analysis. This model is applied to simulating real CCR-walk sequence data of twelve protein sequences. Remarkably long-range correlations are uncovered in the data and the results obtained from these models are reasonably consistent with those available from the ARFIMA(p, d, q) model.展开更多
Chaos game representation (CGR) is proposed as a scale-independent representation for DNA sequences and provides information about the statistical distribution of oligonucleotides in a DNA sequence. CGR images of DN...Chaos game representation (CGR) is proposed as a scale-independent representation for DNA sequences and provides information about the statistical distribution of oligonucleotides in a DNA sequence. CGR images of DNA sequences represent some kinds of fractal patterns, but the common multifractal analysis based on the box counting method cannot deal with CGR images perfectly. Here, the wavelet transform modulus maxima (WTMM) method is applied to the multifractal analysis of CGR images. The results show that the scale-invariance range of CGR edge images can be extended to three orders of magnitude, and complete singularity spectra can be calculated. Spectrum parameters such as the singularity spectrum span are extracted to describe the statistical character of DNA sequences. Compared with the singularity spectrum span, exon sequences with a minimal spectrum span have the most uniform fractal structure. Also, the singularity spectrum parameters are related to oligonueleotide length, sequence component and species, thereby providing a method of studying the length polymorphism of repeat oligonucleotides.展开更多
Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos gam...Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos game representation (CGR) of randomly-linked functional protein sequences, then propose the use of the recurrent iterated function systems (RIFS) in fractal theory to simulate the measure based on their chaos game representations. This method helps to extract some features of functional protein sequences, and furthermore the biological functions of these proteins. Then multifractal analysis of the measures based on the CGRs of randomly-linked functional protein sequences are performed. We find that the CGRs have clear fractal patterns. The numerical results show that the RIFS can simulate the measure based on the CGR very well. The relative standard error and the estimated probability matrix in the RIFS do not depend on the order to link the functional protein sequences. The estimated probability matrices in the RIFS with different biological functions are evidently different. Hence the estimated probability matrices in the RIFS can be used to characterise the difference among linked functional protein sequences with different biological functions. From the values of the Dq curves, one sees that these functional protein sequences are not completely random. The Dq of all linked functional proteins studied are multifractal-like and sufficiently smooth for the Cq (analogous to specific heat) curves to be meaningful. Furthermore, the Dq curves of the measure μ based on their CCRs for different orders to link the functional protein sequences are almost identical if q 〉 0. Finally, the Ca curves of all linked functional proteins resemble a classical phase transition at a critical point.展开更多
We developed a new approach for the reconstruction of phylogeny trees based on the chaos game representation (CGR) of biological sequences. The chaos game representation (CGR) method generates a picture from a biologi...We developed a new approach for the reconstruction of phylogeny trees based on the chaos game representation (CGR) of biological sequences. The chaos game representation (CGR) method generates a picture from a biological sequence, which displays both local and global patterns. The quantitative index of the biological sequence is extracted from the picture. The Kullback-Leibler discrimination information is used as a diversity indicator to measure the dissimilarity of each pair of biological sequences. The new method is inspected by two data sets: the Eutherian orders using concatenated H-stranded amino acid sequences and the genome sequence of the SARS and coronavirus. The phylogeny trees constructed by the new method are consistent with the commonly accepted ones. These results are very promising and suggest more efforts for further developments.展开更多
Chaos game representation (CGR) of DNA sequences and linked protein sequences from genomes was proposed by Jeffrey (1990) and Yu et al. (2004), respectively. In this paper, we consider the CGR of three kinds of sequen...Chaos game representation (CGR) of DNA sequences and linked protein sequences from genomes was proposed by Jeffrey (1990) and Yu et al. (2004), respectively. In this paper, we consider the CGR of three kinds of sequences from complete genomes: whole genome DNA sequences, linked coding DNA sequences and linked protein sequences. Some fractal patterns are found in these CGRs. A recurrent iterated function systems (RIFS) model is proposed to simulate the CGRs of these sequences from genomes and their induced measures. Numerical results on 50 genomes show that the RIFS model can simulate very well the CGRs and their induced measures. The parameters estimated in the RIFS model reflect information on species classification.展开更多
The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and...The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.展开更多
Comparison between different biological sequences is a key step in bioinformatics when analyzing similarities of sequences and phylogenetic relationships. A method of graphically representing biological sequences know...Comparison between different biological sequences is a key step in bioinformatics when analyzing similarities of sequences and phylogenetic relationships. A method of graphically representing biological sequences known as Chaos Game Representation (CGR) has achieved many applications in the studies of bioinformatics. The key issue in the application of CGR is to extract as many useful features as possible from CGR. Initially, CGR was applied to DNA sequences, but in this paper, a CGR-based approach is used to extract suitable features for comparing protein sequences of SARS-CoV-2 and other viruses. For this aim, several viral protein sequences from 12 groups are considered and CGR centroid, amino acid frequency, compounded frequency, Shannon entropy, and Kullback-Lieber Discrimination Information are applied to find the inter-relationship among the sequences. The experimental results demonstrate the potential strengths of CGR-based method for examining the evolutionary relationship of protein sequences. Our method is powerful for extracting effective features from protein sequences, and therefore important in classifying proteins and inferring the phylogeny of viruses.展开更多
利用DNA序列的混沌游戏表示(chaos game representation,CGR),提出了将2维DNA图谱转化成相应的类谱格式的方法。该方法不仅提供了一个较好的视觉表示,而且可将DNA序列转化成一个时间序列。利用CGR坐标将DNA序列转化成CGR弧度序列,并引...利用DNA序列的混沌游戏表示(chaos game representation,CGR),提出了将2维DNA图谱转化成相应的类谱格式的方法。该方法不仅提供了一个较好的视觉表示,而且可将DNA序列转化成一个时间序列。利用CGR坐标将DNA序列转化成CGR弧度序列,并引入长记忆ARFIMA(p,d,q)模型去拟合此类序列,发现此类序列中有显著的长相关性且拟合度很好。展开更多
利用基于经典HP模型的蛋白质序列混沌游走方法(chaos game representation,CGR),给出了RHD基因的蛋白质序列CGR图,可视作蛋白质序列二级结构的一个特征图谱描述,对临床上的血型鉴别有一定的参考价值.另外,还根据由Jeffrey在1990年提出...利用基于经典HP模型的蛋白质序列混沌游走方法(chaos game representation,CGR),给出了RHD基因的蛋白质序列CGR图,可视作蛋白质序列二级结构的一个特征图谱描述,对临床上的血型鉴别有一定的参考价值.另外,还根据由Jeffrey在1990年提出的描绘DNA序列的CGR方法,给出了RHD基因的DNA序列的CGR图,并且根据RHD基因DNA序列的CGR图算出了RHD基因相应的马尔可夫两步转移概率矩阵,从概率矩阵表可以看出RHD基因对编码氨基酸的三联子的第3个碱基的使用偏好性.展开更多
Over the course of human history, influenza pandemics have been seen as major disasters, so studies on the influenza virus have become an important issue for many experts and scholars. Comprehensive research has been ...Over the course of human history, influenza pandemics have been seen as major disasters, so studies on the influenza virus have become an important issue for many experts and scholars. Comprehensive research has been performed over the years on the biological properties, chemical characteristics, external environmental factors and other aspects of the virus, and some results have been achieved. Based on the chaos game representation walk model, this paper uses the time series analysis method to study the DNA sequences of the influenza virus from 1913 to 2010, and works out the early-warning signals indicator value for the outbreak of an influenza pandemic. The variances in the CCR wall〈 sequences for the pandemic years (or + -1 to 2 years) are significantly higher than those for the adjacent years, while those in the non-pandemic years are usually smaller. In this way we can provide an influenza early-warning mechanism so that people can take precautions and be well prepared prior to a pandemic.展开更多
流感病毒分为三类:甲型(A型),乙型(B型),丙型(C型).在这三种类型中甲型(A型)流感病毒是最致命的流感病毒,对人类引起了严重疾病.本文对甲型流感病毒DNA序列建立了一种新的时间序列模型,即CGR(Chaos Game Representation)弧度序列.利用CG...流感病毒分为三类:甲型(A型),乙型(B型),丙型(C型).在这三种类型中甲型(A型)流感病毒是最致命的流感病毒,对人类引起了严重疾病.本文对甲型流感病毒DNA序列建立了一种新的时间序列模型,即CGR(Chaos Game Representation)弧度序列.利用CGR坐标将甲流病毒DNA序列转换成CGR弧度序列,且引入长记忆ARFIMA模型去拟合此类序列,发现随机找来的10条H1N1序列,10条H3N2序列都具有长相关性且拟合很好,并且还发现这两种序列可以尝试用不同的ARFIMA模型去识别,其中H1N1可用ARFIMA(0,d,5)模型去识别,H3N2可用ARFIMA(1,d,1)模型去识别.展开更多
基金Project supported by the National Natural Science Foundation of China (Grant No 60575038)the Natural Science Foundation of Jiangnan University, China (Grant No 20070365)the Program for Innovative Research Team of Jiangnan University, China
文摘A new chaos game representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model has been proposed by Yu et al (Physica A 337(2004) 171). A CGR-walk model is proposed based on the new CGR coordinates for the protein sequences from complete genomes in the present paper. The new CCR coordinates based on the detailed HP model are converted into a time series, and a long-memory ARFIMA(p, d, q) model is introduced into the protein sequence analysis. This model is applied to simulating real CCR-walk sequence data of twelve protein sequences. Remarkably long-range correlations are uncovered in the data and the results obtained from these models are reasonably consistent with those available from the ARFIMA(p, d, q) model.
基金Project supported by the Science and Technology Commission of Shanghai Municipality (Grant No. 05DZ19747)the National Basic Research Program of China (Grant No. 2006CB504509)
文摘Chaos game representation (CGR) is proposed as a scale-independent representation for DNA sequences and provides information about the statistical distribution of oligonucleotides in a DNA sequence. CGR images of DNA sequences represent some kinds of fractal patterns, but the common multifractal analysis based on the box counting method cannot deal with CGR images perfectly. Here, the wavelet transform modulus maxima (WTMM) method is applied to the multifractal analysis of CGR images. The results show that the scale-invariance range of CGR edge images can be extended to three orders of magnitude, and complete singularity spectra can be calculated. Spectrum parameters such as the singularity spectrum span are extracted to describe the statistical character of DNA sequences. Compared with the singularity spectrum span, exon sequences with a minimal spectrum span have the most uniform fractal structure. Also, the singularity spectrum parameters are related to oligonueleotide length, sequence component and species, thereby providing a method of studying the length polymorphism of repeat oligonucleotides.
基金Project partially supported by the National Natural Science Foundation of China (Grant No.30570426)the Chinese Program for New Century Excellent Talents in University (Grant No.NCET-08-06867)+1 种基金Fok Ying Tung Education Foundation (Grant No.101004)Australian Research Council (Grant No.DP0559807)
文摘Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos game representation (CGR) of randomly-linked functional protein sequences, then propose the use of the recurrent iterated function systems (RIFS) in fractal theory to simulate the measure based on their chaos game representations. This method helps to extract some features of functional protein sequences, and furthermore the biological functions of these proteins. Then multifractal analysis of the measures based on the CGRs of randomly-linked functional protein sequences are performed. We find that the CGRs have clear fractal patterns. The numerical results show that the RIFS can simulate the measure based on the CGR very well. The relative standard error and the estimated probability matrix in the RIFS do not depend on the order to link the functional protein sequences. The estimated probability matrices in the RIFS with different biological functions are evidently different. Hence the estimated probability matrices in the RIFS can be used to characterise the difference among linked functional protein sequences with different biological functions. From the values of the Dq curves, one sees that these functional protein sequences are not completely random. The Dq of all linked functional proteins studied are multifractal-like and sufficiently smooth for the Cq (analogous to specific heat) curves to be meaningful. Furthermore, the Dq curves of the measure μ based on their CCRs for different orders to link the functional protein sequences are almost identical if q 〉 0. Finally, the Ca curves of all linked functional proteins resemble a classical phase transition at a critical point.
文摘We developed a new approach for the reconstruction of phylogeny trees based on the chaos game representation (CGR) of biological sequences. The chaos game representation (CGR) method generates a picture from a biological sequence, which displays both local and global patterns. The quantitative index of the biological sequence is extracted from the picture. The Kullback-Leibler discrimination information is used as a diversity indicator to measure the dissimilarity of each pair of biological sequences. The new method is inspected by two data sets: the Eutherian orders using concatenated H-stranded amino acid sequences and the genome sequence of the SARS and coronavirus. The phylogeny trees constructed by the new method are consistent with the commonly accepted ones. These results are very promising and suggest more efforts for further developments.
文摘Chaos game representation (CGR) of DNA sequences and linked protein sequences from genomes was proposed by Jeffrey (1990) and Yu et al. (2004), respectively. In this paper, we consider the CGR of three kinds of sequences from complete genomes: whole genome DNA sequences, linked coding DNA sequences and linked protein sequences. Some fractal patterns are found in these CGRs. A recurrent iterated function systems (RIFS) model is proposed to simulate the CGRs of these sequences from genomes and their induced measures. Numerical results on 50 genomes show that the RIFS model can simulate very well the CGRs and their induced measures. The parameters estimated in the RIFS model reflect information on species classification.
文摘The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.
文摘Comparison between different biological sequences is a key step in bioinformatics when analyzing similarities of sequences and phylogenetic relationships. A method of graphically representing biological sequences known as Chaos Game Representation (CGR) has achieved many applications in the studies of bioinformatics. The key issue in the application of CGR is to extract as many useful features as possible from CGR. Initially, CGR was applied to DNA sequences, but in this paper, a CGR-based approach is used to extract suitable features for comparing protein sequences of SARS-CoV-2 and other viruses. For this aim, several viral protein sequences from 12 groups are considered and CGR centroid, amino acid frequency, compounded frequency, Shannon entropy, and Kullback-Lieber Discrimination Information are applied to find the inter-relationship among the sequences. The experimental results demonstrate the potential strengths of CGR-based method for examining the evolutionary relationship of protein sequences. Our method is powerful for extracting effective features from protein sequences, and therefore important in classifying proteins and inferring the phylogeny of viruses.
基金supported by the National Natural Science Grant No.60575038Jiangnan University Grant No.20070365 and the Program for Innovative Research Team of Jiangnan University~~
文摘利用DNA序列的混沌游戏表示(chaos game representation,CGR),提出了将2维DNA图谱转化成相应的类谱格式的方法。该方法不仅提供了一个较好的视觉表示,而且可将DNA序列转化成一个时间序列。利用CGR坐标将DNA序列转化成CGR弧度序列,并引入长记忆ARFIMA(p,d,q)模型去拟合此类序列,发现此类序列中有显著的长相关性且拟合度很好。
文摘利用基于经典HP模型的蛋白质序列混沌游走方法(chaos game representation,CGR),给出了RHD基因的蛋白质序列CGR图,可视作蛋白质序列二级结构的一个特征图谱描述,对临床上的血型鉴别有一定的参考价值.另外,还根据由Jeffrey在1990年提出的描绘DNA序列的CGR方法,给出了RHD基因的DNA序列的CGR图,并且根据RHD基因DNA序列的CGR图算出了RHD基因相应的马尔可夫两步转移概率矩阵,从概率矩阵表可以看出RHD基因对编码氨基酸的三联子的第3个碱基的使用偏好性.
基金Project supported by the Fundamental Research Funds for the Central Universities (Grant No. JUSRP21117)the Program for Innovative Research Team of Jiangnan University (Grant No. 2008CX002)
文摘Over the course of human history, influenza pandemics have been seen as major disasters, so studies on the influenza virus have become an important issue for many experts and scholars. Comprehensive research has been performed over the years on the biological properties, chemical characteristics, external environmental factors and other aspects of the virus, and some results have been achieved. Based on the chaos game representation walk model, this paper uses the time series analysis method to study the DNA sequences of the influenza virus from 1913 to 2010, and works out the early-warning signals indicator value for the outbreak of an influenza pandemic. The variances in the CCR wall〈 sequences for the pandemic years (or + -1 to 2 years) are significantly higher than those for the adjacent years, while those in the non-pandemic years are usually smaller. In this way we can provide an influenza early-warning mechanism so that people can take precautions and be well prepared prior to a pandemic.
文摘流感病毒分为三类:甲型(A型),乙型(B型),丙型(C型).在这三种类型中甲型(A型)流感病毒是最致命的流感病毒,对人类引起了严重疾病.本文对甲型流感病毒DNA序列建立了一种新的时间序列模型,即CGR(Chaos Game Representation)弧度序列.利用CGR坐标将甲流病毒DNA序列转换成CGR弧度序列,且引入长记忆ARFIMA模型去拟合此类序列,发现随机找来的10条H1N1序列,10条H3N2序列都具有长相关性且拟合很好,并且还发现这两种序列可以尝试用不同的ARFIMA模型去识别,其中H1N1可用ARFIMA(0,d,5)模型去识别,H3N2可用ARFIMA(1,d,1)模型去识别.