A new system is developed to recognize promoter sequences from non promoter sequences based on position weight matrix and backpropagation neural network in this paper. The system performs significantly better on the t...A new system is developed to recognize promoter sequences from non promoter sequences based on position weight matrix and backpropagation neural network in this paper. The system performs significantly better on the training set and the test set, the mean recognition rate is as high as 99% on the training set and 97% on the testing set. Experimental results demonstrate the effectiveness of the system to recognize the promoter sequences that have been trained and the promoter sequences that have not been seen previously.展开更多
Transcription factor binding sites (TFBS) play key roles in genebior 6.8 wavelet expression and regulation. They are short sequence segments with definite structure and can be recognized by the corresponding transcr...Transcription factor binding sites (TFBS) play key roles in genebior 6.8 wavelet expression and regulation. They are short sequence segments with definite structure and can be recognized by the corresponding transcription factors correctly. From the viewpoint of statistics, the candidates of TFBS should be quite different from the segments that are randomly combined together by nucleotide. This paper proposes a combined statistical model for finding over- represented short sequence segments in different kinds of data set. While the over-represented short sequence segment is described by position weight matrix, the nucleotide distribution at most sites of the segment should be far from the background nucleotide distribution. The central idea of this approach is to search for such kind of signals. This algorithm is tested on 3 data sets, including binding sites data set of cyclic AMP receptor protein in E.coli, PlantProm DB which is a non-redundant collection of proximal promoter sequences from different species, collection of the intergenic sequences of the whole genome of E.Coli. Even though the complexity of these three data sets is quite different, the results show that this model is rather general and sensible.展开更多
文摘A new system is developed to recognize promoter sequences from non promoter sequences based on position weight matrix and backpropagation neural network in this paper. The system performs significantly better on the training set and the test set, the mean recognition rate is as high as 99% on the training set and 97% on the testing set. Experimental results demonstrate the effectiveness of the system to recognize the promoter sequences that have been trained and the promoter sequences that have not been seen previously.
基金Project supported by the National Natural Science Foundation of China (Grant No 70671089)the Key Important Project(No 10635040)
文摘Transcription factor binding sites (TFBS) play key roles in genebior 6.8 wavelet expression and regulation. They are short sequence segments with definite structure and can be recognized by the corresponding transcription factors correctly. From the viewpoint of statistics, the candidates of TFBS should be quite different from the segments that are randomly combined together by nucleotide. This paper proposes a combined statistical model for finding over- represented short sequence segments in different kinds of data set. While the over-represented short sequence segment is described by position weight matrix, the nucleotide distribution at most sites of the segment should be far from the background nucleotide distribution. The central idea of this approach is to search for such kind of signals. This algorithm is tested on 3 data sets, including binding sites data set of cyclic AMP receptor protein in E.coli, PlantProm DB which is a non-redundant collection of proximal promoter sequences from different species, collection of the intergenic sequences of the whole genome of E.Coli. Even though the complexity of these three data sets is quite different, the results show that this model is rather general and sensible.