摘要
To obtain the statistical sequence analysis on a large number of genomic and proteomic sequences available for different organisms, the n-grams of whole genome protein sequences from 20 organisms were extracted. Their linguistic features were analyzed by two tests: Zipf power law and Shannon entropy, developed for analysis of natural languages and symbolic sequences. The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered. The results show that: the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4; the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins; a simple uni-gram model can distinguish different organisms; there exist organism-specific usages of "phrases" in protein sequences. It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence, structure and function.
To obtain the statistical sequence analysis on a large number of genomic and proteomic sequences available for different organisms, the n-grams of whole genome protein sequences from 20 organisms were extracted. Their linguistic features were analyzed by two tests : Zipf power law and Shannon entropy, developed for analysis of natural languages and symbolic sequences. The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered. The results show that: the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4 ; the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins; a simple unigram model can distinguish different organisms ; there exist organism-specific usages of “phrases” in protein sequences. It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence, structure and function.
基金
Sponsored by the National Natural Science Foundation of China(Grant No.60435020)