Structural genomics (SG) is an international effort that aims at solving three-dimensional shapes of important biological macro-molecules with primary focus on proteins. One of the main bottlenecks in SG is the abilit...Structural genomics (SG) is an international effort that aims at solving three-dimensional shapes of important biological macro-molecules with primary focus on proteins. One of the main bottlenecks in SG is the ability to produce dif-fraction quality crystals for X-ray crystallogra-phy based protein structure determination. SG pipelines allow for certain flexibility in target selection which motivates development of in- silico methods for sequence-based prediction/ assessment of the protein crystallization pro-pensity. We overview existing SG databanks that are used to derive these predictive models and we discuss analytical results concerning protein sequence properties that were discov-ered to correlate with the ability to form crystals. We also contrast and empirically compare mo- dern sequence-based predictors of crystalliza-tion propensity including OB-Score, ParCrys, XtalPred and CRYSTALP2. Our analysis shows that these methods provide useful and compli-mentary predictions. Although their average ac- curacy is similar at around 70%, we show that application of a simple majority-vote based en-semble improves accuracy to almost 74%. The best improvements are achieved by combining XtalPred with CRYSTALP2 while OB-Score and ParCrys methods overlap to a larger extend, although they still complement the other two predictors. We also demonstrate that 90% of the protein chains can be correctly predicted by at least one of these methods, which suggests that more accurate ensembles could be built in the future. We believe that current protein crystalli-zation propensity predictors could provide useful input for the target selection procedures utilized by the SG centers.展开更多
文摘Structural genomics (SG) is an international effort that aims at solving three-dimensional shapes of important biological macro-molecules with primary focus on proteins. One of the main bottlenecks in SG is the ability to produce dif-fraction quality crystals for X-ray crystallogra-phy based protein structure determination. SG pipelines allow for certain flexibility in target selection which motivates development of in- silico methods for sequence-based prediction/ assessment of the protein crystallization pro-pensity. We overview existing SG databanks that are used to derive these predictive models and we discuss analytical results concerning protein sequence properties that were discov-ered to correlate with the ability to form crystals. We also contrast and empirically compare mo- dern sequence-based predictors of crystalliza-tion propensity including OB-Score, ParCrys, XtalPred and CRYSTALP2. Our analysis shows that these methods provide useful and compli-mentary predictions. Although their average ac- curacy is similar at around 70%, we show that application of a simple majority-vote based en-semble improves accuracy to almost 74%. The best improvements are achieved by combining XtalPred with CRYSTALP2 while OB-Score and ParCrys methods overlap to a larger extend, although they still complement the other two predictors. We also demonstrate that 90% of the protein chains can be correctly predicted by at least one of these methods, which suggests that more accurate ensembles could be built in the future. We believe that current protein crystalli-zation propensity predictors could provide useful input for the target selection procedures utilized by the SG centers.