期刊文献+

Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis 被引量:1

Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
原文传递
导出
摘要 Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naive Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences. Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naive Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.
出处 《Genomics, Proteomics & Bioinformatics》 SCIE CAS CSCD 2006年第2期120-133,共14页 基因组蛋白质组与生物信息学报(英文版)
关键词 subcellular localization Machine Learning Exploratory Data Analysis Decision Tree subcellular localization, Machine Learning, Exploratory Data Analysis, Decision Tree
  • 相关文献

参考文献36

  • 1[1]Huh,W.K.,et al.2003.Global analysis of protein localization in budding yeast.Nature 425:686-691.
  • 2[2]Taylor,S.W.,et al.2003.Characterization of the human heart mitochondrial proteome.Nature Biotechnol.21:281-286.
  • 3[3]Fountoulakis,M.,et al.2002.The rat liver mitochondrial proteins.Electrophoresis 23:311-328.
  • 4[4]Werhahn,W.and Braun,H.P.2002.Biochemical dissection of the mitochondrial proteome from Arabidopsis thaliana by three-dimensional gel electrophoresis.Electrophoresis 23:640-646.
  • 5[5]Claros,M.G.1995.MitoProt,a Macintosh application for studying mitochondrial proteins.Comput.Appl.Biosci.11:441-447.
  • 6[6]Horton,P.and Nakai,K.1997.Better prediction of protein cellular localization sites with the k nearest neighbors classifier.Proc.Int.Conf.Intell.Syst.Mol.Biol.5:147-152.
  • 7[7]Emanuelsson,O.,et al.2000.Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.J.Mol.Biol.300:1005-1016.
  • 8[8]Hua,S.and Sun,Z.2001.Support vector machine approach for protein subcellular localization prediction.Bioinformatics 17:721-728.
  • 9[9]Cui,Q.,et al.2004.Esub8:a novel tool to predict protein subcellular localizations in eukaryotic organisms.BMC Bioinformatics 5:66.
  • 10[10]Sarda,D.,et al.2005.pSLIP:SVM based protein subcellular localization prediction using multiple physicochemical properties.BMC Bioinformatics 6:152.

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部