摘要
蛋白质的结构和功能特性由其氨基酸序列编码,控制序列结构映射的规则被认为是二级遗传密码,氨基酸字母表的简化可以减少蛋白质序列中的冗余,有助于揭示编码规则.基于氨基酸的单体特征、成对相互作用和相似性,可以简化氨基酸字母表.目前,仅基于蛋白质的序列信息,根据最近邻氨基酸的出现频率构建了一个氨基酸的嵌入表示.在此基础上,提出一种通过重构最近邻氨基酸的出现频率来压缩嵌入表示的模型,将此方法命名为AA2Vec.实验结果表明,与其他表示维相比,特定表示维(三维)具有显著的鲁棒性.提取的信息捕捉了氨基酸的物理化学和生化特性以及最近邻氨基酸之间的相互作用.值得注意的是,提出的方法对于具有不同序列标识的序列数据集(SCOPe)是稳定的.这种方法给出了氨基酸的最简表示,有助于生成蛋白质序列的简化表示和建立蛋白质的简化模型.
The Structural and functional properties of proteins are encoded in their amino acid sequences.The rules governing the sequence-structure mapping are believed as the secondary genetic code.The simplification of the amino acid alphabet is a way to reduce the redundancy in the protein sequences and to help to disclose the coding rules.Based on the monomeric features as well as pairwise interactions and similarity of the amino acids,the amino acid alphabet can be simplified.Now,based on solely the sequence information of proteins,we construct an embedded representation of amino acids based on the occurrence frequency of the nearest neighbor amino acids.Based on this representation,we propose a model to compress the embedded representation by reconstructing the occurrence frequency of the nearest neighbor amino acids.We name this method AA2Vec.It is observed that the specific representation dimension(the three dimension)has a significant robustness comparing with the others.The extracted information captures the physicochemical and biochemical properties of amino acids and nearest neighbor amino acids7 interaction.It is worth noting that our method is stable for sequence dataset(SCOPe)with different sequence identities.Our method proposes the minimal representation of amino acids and this kind of characterization may help to generate simplified representations for protein sequences and to build simplified models for proteins.
作者
张鑫鹏
王骏
王炜
Zhang Xinpeng;Wang Jun;Wang Wei(School of Physics,Nanjing University,Nanjing,210093,China)
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2022年第1期103-114,共12页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(11774157,11934008)。