Background: DNA methylation will influence the gene expression pattern and cause the changes of the genetic functions. Computational analysis of the methylation status for nucleotides can help to explore the underlyin...Background: DNA methylation will influence the gene expression pattern and cause the changes of the genetic functions. Computational analysis of the methylation status for nucleotides can help to explore the underlying reasons for developing methylations. Results: We present a DNA sequence based method to analyze the methylation status of CpG dinucleotides using 5bp (5-mer) DNA fragments – named as the word composition encoding method. The prediction accuracy is 75.16% when all 5bp word compositions are used (totally 45 = 1024). Furthermore, 5-bp DNA fragments/words having the most impact on the methylation status are identified by mRMR (Maximum-Relevant-Minimum-Redundancy) feature selection method. As a result, 58 words are selected, and they are used to build a compact predictor, which achieves 77.45% prediction accuracy. When the word composition encoding method and the feature selection strategy are coupled together, the meaning of these words can be analyzed through their contribution towards the prediction. The biological evidence in the literature supports that the surrounding DNA sequence of the CpG dinucleotides will affect the methylation of the CpG dinucleotides. Conclusions: The main contribution of this paper is to find out and analyze the key DNA words taken from the neighbor-hood of the CpG dinucleotides that are inducing the DNA methylation.展开更多
文摘Background: DNA methylation will influence the gene expression pattern and cause the changes of the genetic functions. Computational analysis of the methylation status for nucleotides can help to explore the underlying reasons for developing methylations. Results: We present a DNA sequence based method to analyze the methylation status of CpG dinucleotides using 5bp (5-mer) DNA fragments – named as the word composition encoding method. The prediction accuracy is 75.16% when all 5bp word compositions are used (totally 45 = 1024). Furthermore, 5-bp DNA fragments/words having the most impact on the methylation status are identified by mRMR (Maximum-Relevant-Minimum-Redundancy) feature selection method. As a result, 58 words are selected, and they are used to build a compact predictor, which achieves 77.45% prediction accuracy. When the word composition encoding method and the feature selection strategy are coupled together, the meaning of these words can be analyzed through their contribution towards the prediction. The biological evidence in the literature supports that the surrounding DNA sequence of the CpG dinucleotides will affect the methylation of the CpG dinucleotides. Conclusions: The main contribution of this paper is to find out and analyze the key DNA words taken from the neighbor-hood of the CpG dinucleotides that are inducing the DNA methylation.