摘要
蛋白质结构域对于蛋白质结构和功能研究具有重要意义。针对目前从头预测蛋白质结构域的方法普遍存在精度不高、耗费资源多等问题,提出了一种基于语言模型的蛋白质结构域边界预测方法DomTransformer,该方法基于蛋白质结构分类数据库(CATH)、蛋白质结构预测关键评估(CASP)竞赛数据,以及在AFDB(AlphaFold protein structure database)基础上建立的域数据库等共同构建数据集,搭建了基于Transformer网络架构和稀疏多头自注意力机制的网络模型,引入了新的特征、接触数和域级MSA(Domain multiple sequence alignment),通过直接预测结构域边界来解决数据不平衡等问题。在独立测试集上的测试结果表明了DomTransformer的有效性。
Protein domains are of great significance for the study of protein structure and function.The current methods for predicting protein structural domains from scratch generally have problems such as low accuracy and high resource consumption,a language model-based protein domain boundary prediction method,DomTransformer,was proposed.In this method,based on the classification database of protein structure(CATH),the key assessment of protein structure prediction(CASP)competition data and the domain database built on the basis of AFDB(AlphaFold protein structure database),the data set are jointly constructed.The network model based on Transformer network architecture and sparse multi-head self-attention mechanism is built and the new features such as contact numbers and domain-level MSA(Domain multiple sequence alignment)are introduced.Through directly predicting the boundaries of structural domains,the problem such as data imbalance can be solved.Test results on an independent test set demonstrate the effectiveness of DomTransformer.
作者
张贵军
汪乾梁
彭春祥
HANG Guijun;WANG Qianliang;PENG Chunxiang(College of Information Engineering,Zhejiang University of Technology,Hangzhou 310023,China)
出处
《浙江工业大学学报》
CAS
北大核心
2024年第5期521-529,共9页
Journal of Zhejiang University of Technology
基金
国家重点研发资助项目(2019YFE0126100)
国家自然科学基金资助项目(62173304)。
关键词
蛋白质结构域
语言模型
从头预测
protein domain
language model
ab initio prediction