摘要
蛋白质的氨基酸序列如何决定空间结构是当今生命科学研究中的核心问题之一.折叠类型反映了蛋白质核心结构的拓扑模式,折叠识别是蛋白质序列-结构研究的重要内容.我们以占Astral 1.65序列数据库中α,β和α/β三类蛋白质总量41.8%的36个无法独立建模的折叠类型为研究对象,选取其中序列一致性小于25%的样本作为训练集,以均方根偏差(RMSD)为指标分别进行系统聚类,生成若干折叠子类,并对各子类建立基于多结构比对算法(MUSTANG)结构比对的概形隐马尔科夫模型(profile-HMM).将Astral 1.65中序列一致性小于95%的9505个样本作为检验集,36个折叠类型的平均识别敏感性为90%,特异性为99%,马修斯相关系数(MCC)为0.95.结果表明:对于成员较多,无法建立统一模型的折叠类型,基于RMSD的系统分类建模均可实现较高准确率的识别,为蛋白质折叠识别拓展了新的方法和思路,为进一步研究奠定了基础.
The mechanism of how protein amino acid sequences determine protein structure is a core issue in biology.The protein fold type reflects the topological pattern of the structure′s core.Fold recognition is an important method in protein sequence-structure research.This article focuses on the 36 fold types that are not incorporated into the unified hidden Markov model(HMM) model but that account for 41.8% of α,β,and α/β protein′s in the Astral 1.65 sequence database.The training set contains samples that have less than 25% sequence identity with each other.We applied the hierarchical clustering method according to root mean square deviation(RMSD) and fold subgroups were generated.A profile-HMM based on a multiple structural alignment algorithm(MUSTANG) structure alignment was then built for each subgroup.After testing 9505 proteins with less than 95% sequence identity from the Astral 1.65 database,the average sensitivity,specificity and Matthew′s correlation coefficient(MCC) of the 36 fold types were found to be 90%,99% and 0.95,respectively.These results show that classification modeling according to RMSD is able to achieve precise fold recognition while a unified HMM cannot be built because there are too many elements in the training set.We have developed a new method and novel ideas to enable profile-HMM protein fold recognition and have laid the foundation for further research.
出处
《物理化学学报》
SCIE
CAS
CSCD
北大核心
2009年第12期2558-2564,共7页
Acta Physico-Chimica Sinica
基金
国家自然科学基金(30570427)
北京市自然科学基金(4092008)资助项目~~