摘要
Although the protein sequence-structure gap continues to enlarge due to the development of high-throughput sequencing tools,the protein structure universe tends to be complete without proteins with novel structural folds deposited in the protein data bank (PDB)recently.In this work,we identify a protein structural dictionary (Frag-K)composed of a set of backbone fragments ranging from 4 to 20 residues as the structural "keywords"that can effectively distinguish between major protein folds.We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large scale of high-quality,non-homologous protein structures available in PDB.We analyze the impacts of clustering cut-offs on the performance of the fragment hbraries.Then,the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins).Our results show that a structural dictionary with N400 4-to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy.
基金
the National Natural Science Foundation of China under Grant Nos.61728211 and 61832019.