期刊文献+

Analysis on n-gram statistics and linguistic features of whole genome protein sequences

Analysis on n-gram statistics and linguistic features of whole genome protein sequences
下载PDF
导出
摘要 To obtain the statistical sequence analysis on a large number of genomic and proteomic sequences available for different organisms, the n-grams of whole genome protein sequences from 20 organisms were extracted. Their linguistic features were analyzed by two tests: Zipf power law and Shannon entropy, developed for analysis of natural languages and symbolic sequences. The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered. The results show that: the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4; the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins; a simple uni-gram model can distinguish different organisms; there exist organism-specific usages of "phrases" in protein sequences. It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence, structure and function. To obtain the statistical sequence analysis on a large number of genomic and proteomic sequences available for different organisms, the n-grams of whole genome protein sequences from 20 organisms were extracted. Their linguistic features were analyzed by two tests : Zipf power law and Shannon entropy, developed for analysis of natural languages and symbolic sequences. The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered. The results show that: the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4 ; the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins; a simple unigram model can distinguish different organisms ; there exist organism-specific usages of “phrases” in protein sequences. It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence, structure and function.
出处 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 2008年第5期694-698,共5页 哈尔滨工业大学学报(英文版)
基金 Sponsored by the National Natural Science Foundation of China(Grant No.60435020)
关键词 n-gram statistics protein sequence Zipf law 蛋白质 绿豆 基因组 层序
  • 相关文献

参考文献10

  • 1Chatzidimitriou-dreismann C A,,Streffer R M,Larhammar D.Lack of biological significance in the ‘linguistic fea- tures’of noncoding DNA—Aquantitative analysis[].Nucleic Acids Research.1996
  • 2Ganapathiraju MK,Klein-seetbaraman J,Balakrisbnan N, et al.Characterization of protein secondary structure using latent semantic analysis[].IEEE Signal Processing Maga- zine.2004
  • 3Coin L,Bateman A,Durbin R.Enhanced protein domain discovery by using language modeling techniques from speech recognition[].Proceedings of the National Academy of Sciences of the United States of America.2003
  • 4Charniak E.Statistical Language Learning[]..1993
  • 5Zipf G K.Human Behavior and the Principle of Least Ef- fort[]..1949
  • 6Anfinsen C B.Principles that govern the folding of protein chains[].Science.1973
  • 7Mantegna,R N,Buldyrev,S V,Goldberger,A L.Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics[].Physical Review E Statistical Nonlinear and Soft Matter Physics.1995
  • 8Voss,R. F.Comment on linguistic features of noncoding DNA sequences[].Physical Review Letters.1996
  • 9Tsonis,AA,Elsner,JB,Tsonis,PA.Is DNA a language[].Journal of Theoretical Biology.1997
  • 10Burge,C,Campbell,AM,Karlin,S.Over- and under- representation of short oligonucleotides in DNA sequences[].Proceedings of the National Academy of Sciences of the United States of America.1992

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部