In this paper, we report a multiple sequence alignment result on the basis of 10 amino acid sequences of the M protein, which come from different coronaviruses (4 SARS associated and 6 others known). The alignment mo...In this paper, we report a multiple sequence alignment result on the basis of 10 amino acid sequences of the M protein, which come from different coronaviruses (4 SARS associated and 6 others known). The alignment model was based on the profile HMM (Hidden Markov Model), and the model training was implemented through the SAHMM (Self Adapting Hidden Markov Model) software developed by the authors.展开更多
The task of clustering Web sessions is to group Web sessions based on similarity and consists of maximizing the intra-group similarity while minimizing the inter-group similarity. The first and foremost question neede...The task of clustering Web sessions is to group Web sessions based on similarity and consists of maximizing the intra-group similarity while minimizing the inter-group similarity. The first and foremost question needed to be considered in clustering Web sessions is how to measure the similarity between Web sessions. However, there are many shortcomings in traditional measurements. This paper introduces a new method for measuring similarities between Web pages that takes into account not only the URL but also the viewing time of the visited Web page. Then we give a new method to measure the similarity of Web sessions using sequence alignment and the similarity of Web page access in detail Experiments have proved that our method is valid and efficient.展开更多
In this letter, we briefly describe a program of self adapting hidden Markov model (SA HMM) and its application in multiple sequences alignment. Program consists of two stage optimisation algorithm.
The alignment operation between many protein sequences or DNAsequences related to the scientific bioinformatics application is very complex.There is a trade-off in the objectives in the existing techniques of Multiple...The alignment operation between many protein sequences or DNAsequences related to the scientific bioinformatics application is very complex.There is a trade-off in the objectives in the existing techniques of MultipleSequence Alignment (MSA). The techniques that concern with speed ignoreaccuracy, whereas techniques that concern with accuracy ignore speed. Theterm alignment means to get the similarity in different sequences with highaccuracy. The more growing number of sequences leads to a very complexand complicated problem. Because of the emergence;rapid development;anddependence on gene sequencing, sequence alignment has become importantin every biological relationship analysis process. Calculating the numberof similar amino acids is the primary method for proving that there is arelationship between two sequences. The time is a main issue in any alignmenttechnique. In this paper, a more effective MSA method for handling themassive multiple protein sequences alignment maintaining the highest accuracy with less time consumption is proposed. The proposed method dependson Artificial Fish Swarm (AFS) algorithm that can break down the mostchallenges of MSA problems. The AFS is exploited to obtain high accuracyin adequate time. ASF has been increasing popularly in various applicationssuch as artificial intelligence, computer vision, machine learning, and dataintensive application. It basically mimics the behavior of fish trying to getthe food in nature. The proposed mechanisms of AFS that is like preying,swarming, following, moving, and leaping help in increasing the accuracy andconcerning the speed by decreasing execution time. The sense organs that aidthe artificial fishes to collect information and vision from the environmenthelp in concerning the accuracy. These features of the proposed AFS make thealignment operation more efficient and are suitable especially for large-scaledata. The implementation and experimental results put the proposed AFS as afirst choice in the queue of alignment compared to the well-known algorithmsin multiple sequence alignment.展开更多
Eukaryotic genomes contain a significant fraction of repeats, which have very important biomedical function. Thus, aligning repeats from short sequences back to reference genome is the key step for further genome anal...Eukaryotic genomes contain a significant fraction of repeats, which have very important biomedical function. Thus, aligning repeats from short sequences back to reference genome is the key step for further genome analysis. Unfortunately, the current aligning algorithms performed poorly in distinguishing repeats and nonrepeats. To this end, we proposed a new algorithm, named HashRepAligner, to address this problem. Finally, the cross comparison with other algorithms was performed, and the results indicated that HashRepAligner outperformed other aligners in terms of the detecting repeats.展开更多
Soybean mosaic virus (SMV), a member of the genus Potyvirus, is a major pathogen of soybean plants in China, and 16 SMV strains have been identified nationwide based on a former detailed SMV classification system. A...Soybean mosaic virus (SMV), a member of the genus Potyvirus, is a major pathogen of soybean plants in China, and 16 SMV strains have been identified nationwide based on a former detailed SMV classification system. As the P3 gene is thought to be involved in viral replication, systemic infection, pathogenicity, and overcoming resistance, knowledge of the P3 gene sequences of SMV and other potyviruses would be useful in efforts to know the genetic relationships among them and control the disease. P3 gene sequences were obtained from representative isolates of the above-mentioned 16 SMV strains and were compared with other SMV strains and 16 Potyvirus species from the National Center for Biotechnology GenBank database. The P3 genes from the 16 SMV isolates are composed of 1041 nucleotides, encoding 347 amino acids, and share 90.7-100% nucleotide (NT) sequence identities and 95.1-100% amino acid (AA) sequence identities. The P3 coding regions of the 16 SMV isolates share high identities (92.4-98.9% NT and 96.0-100% AA) with the reported Korean isolates, followed by the USA isolates (88.5-97.9% NT and 91.4-98.6% AA), and share low identities (80.5-85.2% NT and 82.1-84.7% AA) with the reported HZ 1 and P isolates from Pinellia ternata. The sequence identities of the P3 genes between SMV and the 16 potyviruses varied from 44.4 to 81.9% in the NT sequences and from 21.4 to 85.3% in the AA sequences, respectively. Among them, SMV was closely related to Watermelon mosaic virus (WMV), with 76.0-81.9% NT and 77.5-85.3% AA identities. In addition, the SMV isolates and potyvirus species were clustered into six distinct groups. All the SMV strains isolated from soybean were clustered in Group I, and the remaining species were clustered in other groups. A multiple sequence alignment analysis of the C-terminal regions indicated that the P3 genes within a species were highly conserved, whereas those among species were relatively variable.展开更多
Due to current technology enhancement,molecular databases have exponentially grown requesting faster efficient methods that can handle these amounts of huge data.There-fore,Multi-processing CPUs technology can be used...Due to current technology enhancement,molecular databases have exponentially grown requesting faster efficient methods that can handle these amounts of huge data.There-fore,Multi-processing CPUs technology can be used including physical and logical processors(Hyper Threading)to significantly increase the performance of computations.Accordingly,sequence comparison and pairwise alignment were both found contributing significantly in calculating the resemblance between sequences for constructing optimal alignments.This research used the Hash Table-NGram-Hirschberg(HT-NGH)algo-rithm to represent this pairwise alignment utilizing hashing capabilities.The authors propose using parallel shared memory architecture via Hyper Threading to improve the performance of molecular dataset protein pairwise alignment.The proposed parallel hyper threading method targeted the transformation of the HT-NGH on the datasets decomposition for sequence level efficient utilization within the processing units,that is,reducing idle processing unit situations.The authors combined hyper threading within the multicore architecture processing on shared memory utilization remarking perfor-mance of 24.8%average speed up to 34.4%as the highest boosting rate.The benefit of this work improvement is shown preserving acceptable accuracy,that is,reaching 2.08,2.88,and 3.87 boost-up as well as the efficiency of 1.04,0.96,and 0.97,using 2,3,and 4 cores,respectively,as attractive remarkable results.展开更多
Existing studies have challenged the current definition of named bacterial species,especially in the case of highly recombinogenic bacteria.This has led to considering the use of computational procedures to examine po...Existing studies have challenged the current definition of named bacterial species,especially in the case of highly recombinogenic bacteria.This has led to considering the use of computational procedures to examine potential bacterial clusters that are not identified by species naming.This paper describes the use of sequence data obtained from MLST databases as input for a k-means algorithm extended to deal with housekeeping gene sequences as a metric of similarity for the clustering process.An implementation of the k-means algorithm has been developed based on an existing source code implementation,and it has been evaluated against MLST data.Results point out to potential bacterial clusters that are close to more than one different named species and thus may become candidates for alternative classifications accounting for genotypic information.The use of hierarchical clustering with sequence comparison as similarity metric has the potential to find clusters different from named species by using a more informed cluster formation strategy than a conventional nominal variant of the algorithm.展开更多
Creating a multi-gene alignment matrix for phylogenetic analysis using organelle genomes involves aligning single-gene datasets manually,a process that can be time-consuming and prone to errors.The HomBlocks pipeline ...Creating a multi-gene alignment matrix for phylogenetic analysis using organelle genomes involves aligning single-gene datasets manually,a process that can be time-consuming and prone to errors.The HomBlocks pipeline has been created to eliminate the inaccuracies arising from manual operations.The processing of a large number of sequences,however,remains a time-consuming task.To conquer this challenge,we develop a speedy and efficient method called Organelle Genomes for Phylogenetic Analysis(ORPA).ORPA can quickly generate multiple sequence alignments for whole-genome comparisons by parsing the result files of NCBI BLAST,completing the task just in 1 min.With increasing data volume,the efficiency of ORPA is even more pronounced,over 300 times faster than HomBlocks in aligning 60 high-plant chloroplast genomes.The phylogenetic tree outputs from ORPA are equivalent to HomBlocks,indicating its outstanding efficiency.Due to its speed and accuracy,ORPA can identify species-level evolutionary conflicts,providing valuable insights into evolutionary cognition.展开更多
[Objective] The aim was to identify genetic variation in Citrus sinensis (sweet orange) germplasm from Hunan Province according to the Start Codon Targeted (SCoT) Polymorphism. [Method] The reaction system for SCo...[Objective] The aim was to identify genetic variation in Citrus sinensis (sweet orange) germplasm from Hunan Province according to the Start Codon Targeted (SCoT) Polymorphism. [Method] The reaction system for SCoT amplification from sweet orange was first optimized, and then the SCoT fragments were amplified from 24 sweet orange cultivars collected in Hunan Province and sequenced for genetic variation analysis. [Result] The optimum reaction system for SCoT markers amplification was 2.0 μl containing 80 ng of template DNA, 0.3 mmol/L dNTPs, 0.2 μmol/L primer, 1.6 mmol/L Mg2+, 1.6 U of Taq DNA polymerase and 10×PCR buffer. By using this reaction system, the PCR products from the sweet orange cultivars produced clear and reproducible bands at 100-2 000 bp through electrophoresis. The SCoT fragments of the 24 sweet orange cultivars were 1 090-1 091 bp, with the homology of 99.84% and nucleotide deletion and substitution. After being sequenced, the SCoT polymorphisms could distinguish 12 sweet orange cultivars. In addition, the BLAST result showed that part of the SCoT fragments coding region shared high homology with ribosomal protein S3 N superfamily. [Conclusion] This study will provide a theoretical basis for breeding sweet orange cultivars.展开更多
[Objective] The molecular weight,isoelectric point,signal peptide,domain and other properties of the encoding protein of the known cystatin genes were analyzed.[Method] Cystatin genes were searched in NCBI and the rel...[Objective] The molecular weight,isoelectric point,signal peptide,domain and other properties of the encoding protein of the known cystatin genes were analyzed.[Method] Cystatin genes were searched in NCBI and the related amino acids sequences were downloaded.SMART software was used to predict the domain.SingalP program was used to search signal peptide.TMHMM program was used to search and predict the transmembrane domain.CLUSTAL W program was used to make multiple sequence alignment.Using MEGA3.1 software,...展开更多
AIM: To isolate a novel isoform of human HPO (HPO-205) from human fetal liver Marathon-ready cDNA and characterize its primary biological function. METHODS: 5'-RACE (rapid amplification of cDNA 5' ends) was us...AIM: To isolate a novel isoform of human HPO (HPO-205) from human fetal liver Marathon-ready cDNA and characterize its primary biological function. METHODS: 5'-RACE (rapid amplification of cDNA 5' ends) was used to isolate a novel isoform of hHPO in this paper. The constructed pcDNA(HPO-205), pcDNA(HPO) and pcDNA eukaryotic expression vectors were respectively transfected by lipofectamine method and the stimulation of DNA synthesis was observed by (3)H-TdR incorporation assay. Proteins extracted from different cells were analyzed by Western blot. RESULTS: A novel isoform of hHPO (HPO-205) encoding a 205 amino acid ORF corresponding to a translated production of 23 kDa was isolated and distinguished from the previous HPO that lacked the N-terminal 80 amino acids. The dose-dependent stimulation of DNA synthesis of HepG2 hepatoma cells by HPO-205 demonstrated its similar biological activity with HPO in vitro. The level of MAPK (Mitogen-activated protein kinase) phosphorylation by Western blot analysis revealed that HPO-205 might have the stronger activity of stimulating hepatic cell proliferation than that of HPO. CONCLUSION: A novel isoform of hHPO (HPO-205) was isolated from hepatic-derived cells. The comparison of HPO-205 and HPO will lead to a new insight into the structure and function of hHPO, and provide the new way of thinking to deeply elucidate the biological roles of HPO/ALR.展开更多
Profile hidden Markov models (HMMs) based on classical HMMs have been widely applied for protein sequence identification. The formulation of the forward and backward variables in profile HMMs is made under statistic...Profile hidden Markov models (HMMs) based on classical HMMs have been widely applied for protein sequence identification. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. We propose a fuzzy profile HMM to overcome the limitations of that assumption and to achieve an improved alignment for protein sequences belonging to a given family. The proposed model fuzzifies the forward and backward variables by incorporating Sugeno fuzzy measures and Choquet integrals, thus further extends the generalized HMM. Based on the fuzzified forward and backward variables, we propose a fuzzy Baum-Welch parameter estimation algorithm for profiles. The strong correlations and the sequence preference involved in the protein structures make this fuzzy architecture based model as a suitable candidate for building profiles of a given family, since the fuzzy set can handle uncertainties better than classical methods.展开更多
We analyze for the first time the rules of breaking in an X-palindrome between human and chimpanzee. Results indicate that although the overall changes that occurred in the human X-palindrome are fewer than in the chi...We analyze for the first time the rules of breaking in an X-palindrome between human and chimpanzee. Results indicate that although the overall changes that occurred in the human X-palindrome are fewer than in the chimpanzee, mutations occurring between the left arm and right arm were nearly equivalent both in human and chimpanzee when compared with orangutan, which implies evolutionary synchronization. However, there are many more A/T→G/C changes than G/C→A/T in a single arm, which would lead to an increasing trend in GC content and suggest that the composition is not at equilibrium. In addition, it is remarkable to find that there are much more asymmetrical nucleotide changes between the two arms of the human palindrome than that of the chimpanzee palindrome, and these mutations are prone to occur between bases with similar chemical structures. The symmetry seems higher in the chimpanzee palindrome than in the human X-palindrome.展开更多
In 2009, an emerging citrus viral disease caused by Citrus chlorotic dwarf-associated virus(CCDaV) was discovered in Yunnan Province of China. However, the occurrence and spread of CCDaV in other citrus-growing provin...In 2009, an emerging citrus viral disease caused by Citrus chlorotic dwarf-associated virus(CCDaV) was discovered in Yunnan Province of China. However, the occurrence and spread of CCDaV in other citrus-growing provinces in China is unknown to date. To better understand the distribution and molecular diversity of CCDaV in China, a total of 1 772 citrus samples were collected from 11 major citrus-growing provinces and were tested for CCDaV by PCR. Among these, 134 citrus samples from Guangxi, Yunnan and Guangdong were tested positive for CCDaV, demonstrating that the occurrence and spread of CCDaV are increasing in China. The complete genome sequences of 17 CCDaV isolates from different provinces and hosts were sequenced. Comparisons of the whole-genome sequences of the 17 CCDaV isolates as well as the 15 isolates available in GenBank revealed that the sequence identity was about 99–100%, showing that the CCDaV isolates were highly conserved. Phylogenetic studies showed that the 32 CCDaV isolates belonged to four different groups based on geographical origins and host species, and that CCDaV isolates from China and Turkey were clustered into different groups. The results provide important information for clarifying the distribution and genetic diversity of CCDaV in China.展开更多
MegaBlast is one of the most important programs in NCBI BLAST (Basic Local Alignment Search Tool) toolkits, tIowever, MegaBlast is computation and I/O intensive. It consumes a great deal of memory which is proportio...MegaBlast is one of the most important programs in NCBI BLAST (Basic Local Alignment Search Tool) toolkits, tIowever, MegaBlast is computation and I/O intensive. It consumes a great deal of memory which is proportional to the size of the query sequences set and subject (database) sequences set of product. This paper proposes a new strategy for optimizing MegaBlast. The new strategy exchanges the query and subject sequences sets, and builds a hash table based on new subject sequences. It overlaps I/O with computation, shortens the overall time and reduces the cost of memory, since the memory here is only proportional to the size of subject sequences set. The optimized algorithm is suitable to be parallelized in cluster systems. The parallel algorithm uses query segmentation method. As our experiments shown, the parallel program which is implemented with MPI has fine scalability.展开更多
A fundamental goal in cellular signaling is to understand allosteric communication, the process by which sig-nals originating at one site in a protein propagate reliably to affect distant functional sites. The general...A fundamental goal in cellular signaling is to understand allosteric communication, the process by which sig-nals originating at one site in a protein propagate reliably to affect distant functional sites. The general principles of protein structure that underlie this process remain unknown. Statistical coupling analysis (SCA) is a statistical technique that uses evolutionary data of a protein family to measure correlation between distant functional sites and suggests allosteric communication. In proteins, very distant and small interactions between collections of amino acids provide the communication which can be important for signaling process. In this paper, we present the SCA of protein alignment of the esterase family (pfam ID: PF00756) containing the sequence of antigen 85C secreted by Mycobacterium tuberculosis to identify a subset of interacting residues. Clustering analysis of the pairwise correlation highlighted seven important residue positions in the esterase family alignments. These resi-dues were then mapped on the crystal structure of antigen 85C (PDB ID: 1DQZ). The mapping revealed corre-lation between 3 distant residues (Asp38, Leu123 and Met125) and suggests allosteric communication between them. This information can be used for a new drug against this fatal disease.展开更多
MCM10 protein is an essential replication factor involved in the initiation of DNA replication. A mcm10 mutant (mcm10-1) of budding yeast shows a growth arrest at 37 degrees C. In the present work, we have isolated a ...MCM10 protein is an essential replication factor involved in the initiation of DNA replication. A mcm10 mutant (mcm10-1) of budding yeast shows a growth arrest at 37 degrees C. In the present work, we have isolated a mcm10-1 suppressor strain, which grows at 37 degrees C. Interestingly, this mcm10-1 suppressor undergoes cell cycle arrest at 14 degrees C. A novel gene, YLR003c, is identified by high-copy complementation of this suppressor. We called it as Cms1 (Complementation of Mcm 10 Suppressor). Furthermore, the experiments of transformation show that cells of mcm10-1 suppressor with high-copy plasmid but not low-copy plasmid grow at 14 degrees C, indicating that overexpression of Cms1 can rescue the growth arrest of this mcm10 suppressor at non-permissive temperature. These results suggest that CMS1 protein may functionally interact with MCM10 protein and play a role in the regulation of DNA replication and cell cycle control.展开更多
Advancements in next-generation sequencer(NGS)platforms have improved NGS sequence data production and reduced the cost involved,which has resulted in the production of a large amount of genome data.The downstream ana...Advancements in next-generation sequencer(NGS)platforms have improved NGS sequence data production and reduced the cost involved,which has resulted in the production of a large amount of genome data.The downstream analysis of multiple associated sequences has become a bottleneck for the growing genomic data due to storage and space utilization issues in the domain of bioinformatics.The traditional string-matching algorithms are efficient for small sized data sequences and cannot process large amounts of data for downstream analysis.This study proposes a novel bit-parallelism algorithm called BitmapAligner to overcome the issues faced due to a large number of sequences and to improve the speed and quality of multiple sequence alignment(MSA).The input files(sequences)tested over BitmapAligner can be easily managed and organized using the Hadoop distributed file system.The proposed aligner converts the test file(the whole genome sequence)into binaries of an equal length of the sequence,line by line,before the sequence alignment processing.The Hadoop distributed file system splits the larger files into blocks,based on a defined block size,which is 128 MB by default.BitmapAligner can accurately process the sequence alignment using the bitmask approach on large-scale sequences after sorting the data.The experimental results indicate that BitmapAligner operates in real time,with a large number of sequences.Moreover,BitmapAligner achieves the exact start and end positions of the pattern sequence to test the MSA application in the whole genome query sequence.The MSA’s accuracy is verified by the bitmask indexing property of the bit-parallelism extended shifts(BXS)algorithm.The dynamic and exact approach of the BXS algorithm is implemented through the MapReduce function of Apache Hadoop.Conversely,the traditional seeds-and-extend approach faces the risk of errors while identifying the pattern sequences’positions.Moreover,the proposed model resolves the largescale data challenges that are covered through MapReduce in the Hadoop framework.Hive,Yarn,HBase,Cassandra,and many other pertinent flavors are to be used in the future for data structuring and annotations on the top layer of Hadoop since Hadoop is primarily used for data organization and handles text documents.展开更多
There are many web-based multiple sequence alignment services accessible around the world. However, many researchers working on biological sequence analysis still struggle with inefficient, unfriendly user interface, ...There are many web-based multiple sequence alignment services accessible around the world. However, many researchers working on biological sequence analysis still struggle with inefficient, unfriendly user interface, and limited capability multiple sequence alignment software. In this study, we provide a comprehensive survey of regional and continental facilities that provide web-based alignment services. We also analyze and identify much needed services that are not available through these existing service providers. We then implement a web-based model to address these needs. From that perspective, our web-based multiple sequence alignment server, SeqAna, provides a unique set of services that none of these studied facilities have. For example, SeqAna provides a multiple sequence alignment scoring and ranking service. This service, the only of its kind, allows SeqAna's users to perform multiple sequence alignment with several alignment tools and rank the results of these alignments in the order of quality. With this service, SeqAna's users will be able to identify which alignment tools are more appropriate for their specific set of sequences. In addition, SeqAna's users can customize a small alignment sample as a reference for SeqAna to automatically identify the best tool to align their large set of sequences.展开更多
文摘In this paper, we report a multiple sequence alignment result on the basis of 10 amino acid sequences of the M protein, which come from different coronaviruses (4 SARS associated and 6 others known). The alignment model was based on the profile HMM (Hidden Markov Model), and the model training was implemented through the SAHMM (Self Adapting Hidden Markov Model) software developed by the authors.
基金Supported by the Foundation of Hubei Key Technology Research and Development(2005AA101C18)the Natural Science Founda-tion of South-Central University for Nationalities(YZY06009)
文摘The task of clustering Web sessions is to group Web sessions based on similarity and consists of maximizing the intra-group similarity while minimizing the inter-group similarity. The first and foremost question needed to be considered in clustering Web sessions is how to measure the similarity between Web sessions. However, there are many shortcomings in traditional measurements. This paper introduces a new method for measuring similarities between Web pages that takes into account not only the URL but also the viewing time of the visited Web page. Then we give a new method to measure the similarity of Web sessions using sequence alignment and the similarity of Web page access in detail Experiments have proved that our method is valid and efficient.
文摘In this letter, we briefly describe a program of self adapting hidden Markov model (SA HMM) and its application in multiple sequences alignment. Program consists of two stage optimisation algorithm.
基金The authors extend their appreciation to the Deanship of Scientific Research at Jouf University for funding this work through research Grant No(DSR2020–01–414).
文摘The alignment operation between many protein sequences or DNAsequences related to the scientific bioinformatics application is very complex.There is a trade-off in the objectives in the existing techniques of MultipleSequence Alignment (MSA). The techniques that concern with speed ignoreaccuracy, whereas techniques that concern with accuracy ignore speed. Theterm alignment means to get the similarity in different sequences with highaccuracy. The more growing number of sequences leads to a very complexand complicated problem. Because of the emergence;rapid development;anddependence on gene sequencing, sequence alignment has become importantin every biological relationship analysis process. Calculating the numberof similar amino acids is the primary method for proving that there is arelationship between two sequences. The time is a main issue in any alignmenttechnique. In this paper, a more effective MSA method for handling themassive multiple protein sequences alignment maintaining the highest accuracy with less time consumption is proposed. The proposed method dependson Artificial Fish Swarm (AFS) algorithm that can break down the mostchallenges of MSA problems. The AFS is exploited to obtain high accuracyin adequate time. ASF has been increasing popularly in various applicationssuch as artificial intelligence, computer vision, machine learning, and dataintensive application. It basically mimics the behavior of fish trying to getthe food in nature. The proposed mechanisms of AFS that is like preying,swarming, following, moving, and leaping help in increasing the accuracy andconcerning the speed by decreasing execution time. The sense organs that aidthe artificial fishes to collect information and vision from the environmenthelp in concerning the accuracy. These features of the proposed AFS make thealignment operation more efficient and are suitable especially for large-scaledata. The implementation and experimental results put the proposed AFS as afirst choice in the queue of alignment compared to the well-known algorithmsin multiple sequence alignment.
文摘Eukaryotic genomes contain a significant fraction of repeats, which have very important biomedical function. Thus, aligning repeats from short sequences back to reference genome is the key step for further genome analysis. Unfortunately, the current aligning algorithms performed poorly in distinguishing repeats and nonrepeats. To this end, we proposed a new algorithm, named HashRepAligner, to address this problem. Finally, the cross comparison with other algorithms was performed, and the results indicated that HashRepAligner outperformed other aligners in terms of the detecting repeats.
基金supported by the National Natural Science Foundation of China(30671266,31101164)the National Basic Research Program of China(2006CB101708,2009CB118404)+2 种基金the National 863 Program of China(2006AA100104)the 111 Project from Ministry of Education of China(B08025)the Youth Science and Technology Innovation Foundation of Nanjing Agriculture University,China(KJ2010002)
文摘Soybean mosaic virus (SMV), a member of the genus Potyvirus, is a major pathogen of soybean plants in China, and 16 SMV strains have been identified nationwide based on a former detailed SMV classification system. As the P3 gene is thought to be involved in viral replication, systemic infection, pathogenicity, and overcoming resistance, knowledge of the P3 gene sequences of SMV and other potyviruses would be useful in efforts to know the genetic relationships among them and control the disease. P3 gene sequences were obtained from representative isolates of the above-mentioned 16 SMV strains and were compared with other SMV strains and 16 Potyvirus species from the National Center for Biotechnology GenBank database. The P3 genes from the 16 SMV isolates are composed of 1041 nucleotides, encoding 347 amino acids, and share 90.7-100% nucleotide (NT) sequence identities and 95.1-100% amino acid (AA) sequence identities. The P3 coding regions of the 16 SMV isolates share high identities (92.4-98.9% NT and 96.0-100% AA) with the reported Korean isolates, followed by the USA isolates (88.5-97.9% NT and 91.4-98.6% AA), and share low identities (80.5-85.2% NT and 82.1-84.7% AA) with the reported HZ 1 and P isolates from Pinellia ternata. The sequence identities of the P3 genes between SMV and the 16 potyviruses varied from 44.4 to 81.9% in the NT sequences and from 21.4 to 85.3% in the AA sequences, respectively. Among them, SMV was closely related to Watermelon mosaic virus (WMV), with 76.0-81.9% NT and 77.5-85.3% AA identities. In addition, the SMV isolates and potyvirus species were clustered into six distinct groups. All the SMV strains isolated from soybean were clustered in Group I, and the remaining species were clustered in other groups. A multiple sequence alignment analysis of the C-terminal regions indicated that the P3 genes within a species were highly conserved, whereas those among species were relatively variable.
基金Deanship of Scientific Research(DSR),King Abdulaziz University,Grant/Award Number:D-139-137-1441。
文摘Due to current technology enhancement,molecular databases have exponentially grown requesting faster efficient methods that can handle these amounts of huge data.There-fore,Multi-processing CPUs technology can be used including physical and logical processors(Hyper Threading)to significantly increase the performance of computations.Accordingly,sequence comparison and pairwise alignment were both found contributing significantly in calculating the resemblance between sequences for constructing optimal alignments.This research used the Hash Table-NGram-Hirschberg(HT-NGH)algo-rithm to represent this pairwise alignment utilizing hashing capabilities.The authors propose using parallel shared memory architecture via Hyper Threading to improve the performance of molecular dataset protein pairwise alignment.The proposed parallel hyper threading method targeted the transformation of the HT-NGH on the datasets decomposition for sequence level efficient utilization within the processing units,that is,reducing idle processing unit situations.The authors combined hyper threading within the multicore architecture processing on shared memory utilization remarking perfor-mance of 24.8%average speed up to 34.4%as the highest boosting rate.The benefit of this work improvement is shown preserving acceptable accuracy,that is,reaching 2.08,2.88,and 3.87 boost-up as well as the efficiency of 1.04,0.96,and 0.97,using 2,3,and 4 cores,respectively,as attractive remarkable results.
文摘Existing studies have challenged the current definition of named bacterial species,especially in the case of highly recombinogenic bacteria.This has led to considering the use of computational procedures to examine potential bacterial clusters that are not identified by species naming.This paper describes the use of sequence data obtained from MLST databases as input for a k-means algorithm extended to deal with housekeeping gene sequences as a metric of similarity for the clustering process.An implementation of the k-means algorithm has been developed based on an existing source code implementation,and it has been evaluated against MLST data.Results point out to potential bacterial clusters that are close to more than one different named species and thus may become candidates for alternative classifications accounting for genotypic information.The use of hierarchical clustering with sequence comparison as similarity metric has the potential to find clusters different from named species by using a more informed cluster formation strategy than a conventional nominal variant of the algorithm.
基金supported by the National Key R&D Program of China(2018YFA0903200)Science Technology and Innovation Commission of Shenzhen Municipality of China(ZDSYS 20200811142605017)It was also supported by Innovation Program of Chinese Academy of Agricultural Sciences and the Elite Young Scientists Program of CAAS.
文摘Creating a multi-gene alignment matrix for phylogenetic analysis using organelle genomes involves aligning single-gene datasets manually,a process that can be time-consuming and prone to errors.The HomBlocks pipeline has been created to eliminate the inaccuracies arising from manual operations.The processing of a large number of sequences,however,remains a time-consuming task.To conquer this challenge,we develop a speedy and efficient method called Organelle Genomes for Phylogenetic Analysis(ORPA).ORPA can quickly generate multiple sequence alignments for whole-genome comparisons by parsing the result files of NCBI BLAST,completing the task just in 1 min.With increasing data volume,the efficiency of ORPA is even more pronounced,over 300 times faster than HomBlocks in aligning 60 high-plant chloroplast genomes.The phylogenetic tree outputs from ORPA are equivalent to HomBlocks,indicating its outstanding efficiency.Due to its speed and accuracy,ORPA can identify species-level evolutionary conflicts,providing valuable insights into evolutionary cognition.
基金Supported by National Key Technology Research and Development Program(2006BAD01A1702)~~
文摘[Objective] The aim was to identify genetic variation in Citrus sinensis (sweet orange) germplasm from Hunan Province according to the Start Codon Targeted (SCoT) Polymorphism. [Method] The reaction system for SCoT amplification from sweet orange was first optimized, and then the SCoT fragments were amplified from 24 sweet orange cultivars collected in Hunan Province and sequenced for genetic variation analysis. [Result] The optimum reaction system for SCoT markers amplification was 2.0 μl containing 80 ng of template DNA, 0.3 mmol/L dNTPs, 0.2 μmol/L primer, 1.6 mmol/L Mg2+, 1.6 U of Taq DNA polymerase and 10×PCR buffer. By using this reaction system, the PCR products from the sweet orange cultivars produced clear and reproducible bands at 100-2 000 bp through electrophoresis. The SCoT fragments of the 24 sweet orange cultivars were 1 090-1 091 bp, with the homology of 99.84% and nucleotide deletion and substitution. After being sequenced, the SCoT polymorphisms could distinguish 12 sweet orange cultivars. In addition, the BLAST result showed that part of the SCoT fragments coding region shared high homology with ribosomal protein S3 N superfamily. [Conclusion] This study will provide a theoretical basis for breeding sweet orange cultivars.
文摘[Objective] The molecular weight,isoelectric point,signal peptide,domain and other properties of the encoding protein of the known cystatin genes were analyzed.[Method] Cystatin genes were searched in NCBI and the related amino acids sequences were downloaded.SMART software was used to predict the domain.SingalP program was used to search signal peptide.TMHMM program was used to search and predict the transmembrane domain.CLUSTAL W program was used to make multiple sequence alignment.Using MEGA3.1 software,...
基金the National Natural Science Foundation of China,No.39830440
文摘AIM: To isolate a novel isoform of human HPO (HPO-205) from human fetal liver Marathon-ready cDNA and characterize its primary biological function. METHODS: 5'-RACE (rapid amplification of cDNA 5' ends) was used to isolate a novel isoform of hHPO in this paper. The constructed pcDNA(HPO-205), pcDNA(HPO) and pcDNA eukaryotic expression vectors were respectively transfected by lipofectamine method and the stimulation of DNA synthesis was observed by (3)H-TdR incorporation assay. Proteins extracted from different cells were analyzed by Western blot. RESULTS: A novel isoform of hHPO (HPO-205) encoding a 205 amino acid ORF corresponding to a translated production of 23 kDa was isolated and distinguished from the previous HPO that lacked the N-terminal 80 amino acids. The dose-dependent stimulation of DNA synthesis of HepG2 hepatoma cells by HPO-205 demonstrated its similar biological activity with HPO in vitro. The level of MAPK (Mitogen-activated protein kinase) phosphorylation by Western blot analysis revealed that HPO-205 might have the stronger activity of stimulating hepatic cell proliferation than that of HPO. CONCLUSION: A novel isoform of hHPO (HPO-205) was isolated from hepatic-derived cells. The comparison of HPO-205 and HPO will lead to a new insight into the structure and function of hHPO, and provide the new way of thinking to deeply elucidate the biological roles of HPO/ALR.
文摘Profile hidden Markov models (HMMs) based on classical HMMs have been widely applied for protein sequence identification. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. We propose a fuzzy profile HMM to overcome the limitations of that assumption and to achieve an improved alignment for protein sequences belonging to a given family. The proposed model fuzzifies the forward and backward variables by incorporating Sugeno fuzzy measures and Choquet integrals, thus further extends the generalized HMM. Based on the fuzzified forward and backward variables, we propose a fuzzy Baum-Welch parameter estimation algorithm for profiles. The strong correlations and the sequence preference involved in the protein structures make this fuzzy architecture based model as a suitable candidate for building profiles of a given family, since the fuzzy set can handle uncertainties better than classical methods.
基金ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China (No.20173023 and No.90203012) and the Specialized Research Fund for the Doctoral Program of Higher Education of China (No.20020730006).
文摘We analyze for the first time the rules of breaking in an X-palindrome between human and chimpanzee. Results indicate that although the overall changes that occurred in the human X-palindrome are fewer than in the chimpanzee, mutations occurring between the left arm and right arm were nearly equivalent both in human and chimpanzee when compared with orangutan, which implies evolutionary synchronization. However, there are many more A/T→G/C changes than G/C→A/T in a single arm, which would lead to an increasing trend in GC content and suggest that the composition is not at equilibrium. In addition, it is remarkable to find that there are much more asymmetrical nucleotide changes between the two arms of the human palindrome than that of the chimpanzee palindrome, and these mutations are prone to occur between bases with similar chemical structures. The symmetry seems higher in the chimpanzee palindrome than in the human X-palindrome.
基金supported by the National Key R&D Program of China(2019YFD1001800)the China Agriculture Research System,Overseas Expertise Introduction Project for Discipline Innovation(B18044)+2 种基金the China Agriculture Research System of MOF and MARA(CARS-26-05B)the Natural Science Foundation of Chongqing,China(cstc2019jcyj-msxmX0557)the Guangxi Natural Science Foundation,China(2018GXNSFBA050027)。
文摘In 2009, an emerging citrus viral disease caused by Citrus chlorotic dwarf-associated virus(CCDaV) was discovered in Yunnan Province of China. However, the occurrence and spread of CCDaV in other citrus-growing provinces in China is unknown to date. To better understand the distribution and molecular diversity of CCDaV in China, a total of 1 772 citrus samples were collected from 11 major citrus-growing provinces and were tested for CCDaV by PCR. Among these, 134 citrus samples from Guangxi, Yunnan and Guangdong were tested positive for CCDaV, demonstrating that the occurrence and spread of CCDaV are increasing in China. The complete genome sequences of 17 CCDaV isolates from different provinces and hosts were sequenced. Comparisons of the whole-genome sequences of the 17 CCDaV isolates as well as the 15 isolates available in GenBank revealed that the sequence identity was about 99–100%, showing that the CCDaV isolates were highly conserved. Phylogenetic studies showed that the 32 CCDaV isolates belonged to four different groups based on geographical origins and host species, and that CCDaV isolates from China and Turkey were clustered into different groups. The results provide important information for clarifying the distribution and genetic diversity of CCDaV in China.
基金Supported by the National Natural Science Foundation of China under Grant No. 60372040, Knowledge Innovative Project of Chinese Academy of Sciences under Grant No. KSCX2-SW-233 and 863 Grid Node of Hong Kong University under Grant No. 2002AA104530. Acknowledgements We would like to thank the anonymous reviewers for their suggestions on how to improve this paper, The experimental data sets are provided by Beijing Gcnomics Institute, Chinese Academy of Sciences.
文摘MegaBlast is one of the most important programs in NCBI BLAST (Basic Local Alignment Search Tool) toolkits, tIowever, MegaBlast is computation and I/O intensive. It consumes a great deal of memory which is proportional to the size of the query sequences set and subject (database) sequences set of product. This paper proposes a new strategy for optimizing MegaBlast. The new strategy exchanges the query and subject sequences sets, and builds a hash table based on new subject sequences. It overlaps I/O with computation, shortens the overall time and reduces the cost of memory, since the memory here is only proportional to the size of subject sequences set. The optimized algorithm is suitable to be parallelized in cluster systems. The parallel algorithm uses query segmentation method. As our experiments shown, the parallel program which is implemented with MPI has fine scalability.
文摘A fundamental goal in cellular signaling is to understand allosteric communication, the process by which sig-nals originating at one site in a protein propagate reliably to affect distant functional sites. The general principles of protein structure that underlie this process remain unknown. Statistical coupling analysis (SCA) is a statistical technique that uses evolutionary data of a protein family to measure correlation between distant functional sites and suggests allosteric communication. In proteins, very distant and small interactions between collections of amino acids provide the communication which can be important for signaling process. In this paper, we present the SCA of protein alignment of the esterase family (pfam ID: PF00756) containing the sequence of antigen 85C secreted by Mycobacterium tuberculosis to identify a subset of interacting residues. Clustering analysis of the pairwise correlation highlighted seven important residue positions in the esterase family alignments. These resi-dues were then mapped on the crystal structure of antigen 85C (PDB ID: 1DQZ). The mapping revealed corre-lation between 3 distant residues (Asp38, Leu123 and Met125) and suggests allosteric communication between them. This information can be used for a new drug against this fatal disease.
文摘MCM10 protein is an essential replication factor involved in the initiation of DNA replication. A mcm10 mutant (mcm10-1) of budding yeast shows a growth arrest at 37 degrees C. In the present work, we have isolated a mcm10-1 suppressor strain, which grows at 37 degrees C. Interestingly, this mcm10-1 suppressor undergoes cell cycle arrest at 14 degrees C. A novel gene, YLR003c, is identified by high-copy complementation of this suppressor. We called it as Cms1 (Complementation of Mcm 10 Suppressor). Furthermore, the experiments of transformation show that cells of mcm10-1 suppressor with high-copy plasmid but not low-copy plasmid grow at 14 degrees C, indicating that overexpression of Cms1 can rescue the growth arrest of this mcm10 suppressor at non-permissive temperature. These results suggest that CMS1 protein may functionally interact with MCM10 protein and play a role in the regulation of DNA replication and cell cycle control.
基金This work was supported in part by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2018R1C1B5084424)in part by the Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(No.2019R1A6A1A03032119).
文摘Advancements in next-generation sequencer(NGS)platforms have improved NGS sequence data production and reduced the cost involved,which has resulted in the production of a large amount of genome data.The downstream analysis of multiple associated sequences has become a bottleneck for the growing genomic data due to storage and space utilization issues in the domain of bioinformatics.The traditional string-matching algorithms are efficient for small sized data sequences and cannot process large amounts of data for downstream analysis.This study proposes a novel bit-parallelism algorithm called BitmapAligner to overcome the issues faced due to a large number of sequences and to improve the speed and quality of multiple sequence alignment(MSA).The input files(sequences)tested over BitmapAligner can be easily managed and organized using the Hadoop distributed file system.The proposed aligner converts the test file(the whole genome sequence)into binaries of an equal length of the sequence,line by line,before the sequence alignment processing.The Hadoop distributed file system splits the larger files into blocks,based on a defined block size,which is 128 MB by default.BitmapAligner can accurately process the sequence alignment using the bitmask approach on large-scale sequences after sorting the data.The experimental results indicate that BitmapAligner operates in real time,with a large number of sequences.Moreover,BitmapAligner achieves the exact start and end positions of the pattern sequence to test the MSA application in the whole genome query sequence.The MSA’s accuracy is verified by the bitmask indexing property of the bit-parallelism extended shifts(BXS)algorithm.The dynamic and exact approach of the BXS algorithm is implemented through the MapReduce function of Apache Hadoop.Conversely,the traditional seeds-and-extend approach faces the risk of errors while identifying the pattern sequences’positions.Moreover,the proposed model resolves the largescale data challenges that are covered through MapReduce in the Hadoop framework.Hive,Yarn,HBase,Cassandra,and many other pertinent flavors are to be used in the future for data structuring and annotations on the top layer of Hadoop since Hadoop is primarily used for data organization and handles text documents.
文摘There are many web-based multiple sequence alignment services accessible around the world. However, many researchers working on biological sequence analysis still struggle with inefficient, unfriendly user interface, and limited capability multiple sequence alignment software. In this study, we provide a comprehensive survey of regional and continental facilities that provide web-based alignment services. We also analyze and identify much needed services that are not available through these existing service providers. We then implement a web-based model to address these needs. From that perspective, our web-based multiple sequence alignment server, SeqAna, provides a unique set of services that none of these studied facilities have. For example, SeqAna provides a multiple sequence alignment scoring and ranking service. This service, the only of its kind, allows SeqAna's users to perform multiple sequence alignment with several alignment tools and rank the results of these alignments in the order of quality. With this service, SeqAna's users will be able to identify which alignment tools are more appropriate for their specific set of sequences. In addition, SeqAna's users can customize a small alignment sample as a reference for SeqAna to automatically identify the best tool to align their large set of sequences.