期刊文献+

基于氨基酸组成分布的嗜热和嗜冷蛋白随机森林分类模型 被引量:6

Random Forest for Classification of Thermophilic and Psychrophilic Proteins Based on Amino Acid Composition Distribution
下载PDF
导出
摘要 文献报道采用氨基酸组成分布提取特征值能有效提高预测分类精度,本文采用该方法提取特征值,使用一种新的组合分类器——随机森林,从蛋白质一级结构对嗜热和嗜冷蛋白进行分类。通过10倍交叉验证和独立样本测试两种方法检测,结果表明:当分段数量为1时,其精度最优,分别为92.9%和90.2%,暗示使用基于氨基酸组成分布提取特征值在该算法中并不能有效提高识别精度,这与报道结果不符,而该提取方法在SVM中却能适当提高识别精度;当引入6个新变量后,其精度分别提高到93.2%和92.2%,ROC曲线下面积分别为0.9771和0.9696,优于其它组合分类器。 We used amino acid composition distribution (AACD) to discriminate thermophilic and psychrophilic proteins. We used 10-fold cross-validation and independent testing with other dataset to evaluate the models. The results showed that when the segment was 1, the overall accuracy reached 92.9% and 90.2%, respectively. The AACD method improved the prediction accuracy when support vector machine was used as the classifier. When six new features were introduced, the overall accuracy of random forest improved to 93.2% and 92.2%, the areas under the receiver operation characteristic curve were 0.9771 and 0.9696, which was better than other ensemble classifiers and comparable with that of SVM.
出处 《生物工程学报》 CAS CSCD 北大核心 2008年第2期302-308,共7页 Chinese Journal of Biotechnology
基金 "973计划"(No.2007CB707804) ~~福建省自然科学基金(No.2007J0360)资助项目~~
关键词 随机森林 氨基酸组成分布 嗜热和嗜冷蛋白 ROC曲线 Random forest, amino acid composition distribution, thermophilic and psychrophilic protein, ROC curve
  • 相关文献

参考文献22

  • 1Marc Robinson R, Adam G. Structural genomics of Thermotoga maritima proteins shows that contact order is a major determinant of protein thermostability. Structure, 2005, 6: 857-860.
  • 2Bult CJ, White O, Olsen GJ, et al. Complete genome sequence of the methanogenic archaea Methanococcus jannaschii. Science, 1996, 273: 1058-1073.
  • 3Barbara A. M, Karen EN, Jody W. D, et al. The psychrophilic lifestyle as revealed by the genome sequence of Colwellia psychrerythraea 34H through genomic and proteomic analyses. PNAS, 2005, 102(31): 10913-10918.
  • 4Claudine M, Evelyne K, Geraldine P, et al. Coping with cold: The genome of the versatile marine Antarctica bacterium Pseudoalteromonas haloplanktis TAC125. Genome Res. 2005, 15: 1325-1335.
  • 5Rabus R, Ruepp A, Frickey T, et al. The genome of Desulfotalea psychrophila, a sulfate reducing bacterium from permanently cold Arctic sediments, Environ Microbiol 2004, 6: 887-902.
  • 6Thierry L, Charles G, Georges F. Psychrophilic enzymes: revisiting the thermodynamic parameters of activation may explain local flexibility. BBA-Protein Structure and Molecular Enzymology, 2000, 1543 : (1): 1-10.
  • 7Vieille C, Zeikus GJ. Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiol. Mol. Biol. Rev. 2001, 65: 1-43.
  • 8Ding YR, Cai Y J, Zhang GX, et al. The influence of dipeptide composition on protein thermostability. FEBS Lett. 2004, 569: 284-288.
  • 9Mozo-Villarias A, Querol E. Theoretical analysis and computational prediction of protein thermostability. Curr Bioinf. 2006, 1: 25-31.
  • 10张光亚,方柏山.基于二肽组成识别嗜热和常温蛋白的研究[J].生物工程学报,2006,22(2):293-298. 被引量:5

二级参考文献55

  • 1Garian R.Prediction of quaternary structure from primary structure.Bioinformatics,2001,17(6):551~556
  • 2Chou KC,Cai YD.Predicting protein quaternary structure by pseudo amino acid composition.PROTEINS:Strncture,Function,and Genetics,2003,53(2):282~289
  • 3Zhang SW,Quan P,Zhang HC,Wu YH,Shi JY.Support vector machines for predicting protein homo-oligomers by incorporating pseudo-amino acid composition.Internet Electronic Journal of Molecular Design,2003,2(6):392~402
  • 4Vapnik V.The nature of statistical learning theory.New York:Springer,1995.1~188
  • 5Brown M,Grundy W,Lin D,Cristianini N,Sugnet CW,Ares MJ,Furey TS,Haussler D.Knowledge-based analysis of microarray gene expression data by using support vector machines.Proceedings of the National Academy of Science USA,2000,97(1):262~267
  • 6Jaakkola T,Diekhans M,Haussler D.Using the fisher kernel method to detect remote protein homologies.In:Lengauer T,Schneider R,Bork P,Brutlag DL,Glasgow JI,Mewes HW,Zimmer Palf.Proceedings of the seventh international conference on intelligent systems for molecular biology.Menlo Park:AAAI Press,1999.149~158
  • 7Zien A,Ratsch G,Mika S,Scholkopf B,Lengauer T,Muller KR.Engineering support vector machine kernels that recognize translation initiation sites.Bioinformatics,2000,16(9):799~807
  • 8Cai YD,Liu XJ,Xu XB,Chou KC.Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect.J Cell Biochem,2002,84 (2):343~348
  • 9Ding CH,Dubchak I.Multi-class protein fold recognition Using support vector machines and neural networks.Bioinformatics,2001,17(4):349~358
  • 10Kawashima S,Ogata H,Kanehisa M.AA index:amino acid index database.Nucleic Acids Research,1999,27(1):368~369

共引文献28

同被引文献56

引证文献6

二级引证文献75

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部