摘要
很多细胞的生命活动涉及到特定的DNA分子与蛋白质相互作用,而且这些相互作用与人类很多疾病的产生密切有关。为了了解蛋白质与DNA分子结合的分子机制,确定蛋白质序列中哪些残基与DNA分子结合是非常重要的。但是目前,精确识别蛋白与DNA分子结合残基还很困难。在这项研究中,我们将使用机器学习算法来预测疾病相关蛋白与DNA分子的结合区域,这为下一步精确识别结合位点奠定了基础。预测模型中使用的数据集来自于Uniprot和PDB数据库,我们提取位置特异性打分矩阵(PSSM)、氨基酸的理化指数为特征,利用随机森林算法、5折交叉检验结果得到:在使用103种理化指数作为特征时,预测总精度最高达到94%,精确率、召回率以及马氏相关系数分别为88%、75%和0.78。可见该模型对于疾病相关的蛋白与DNA分子的结合区域是有较好的识别能力。
The interactions of specific DNA molecules with proteins are involved in many cellular activities,and these interactions are closely related to many human diseases.In order to understand the molecular mechanism of proteins bind to DNA molecules,it is important to identify which residues in the biomolecular structure bind to DNA molecules.However,it is difficult to accurately identify the binding residues of proteins to DNA molecules.In this study,we will use machine learning algorithms to predict the binding regions of disease-associated proteins to DNA molecules,which lays the foundation for the next step of precise identification of binding sites.In this paper,the datasets used in the prediction models were extracted from Uniprot and PDB databases,and the location-specific scoring matrix(PSSM)and the physicochemical indices of amino acids were extracted as features,we extracted the location-specific scoring matrix(PSSM)and the physicochemical indexes of amino acids as the features,and used the random forest algorithm,5 fold cross-test results showed that the total accuracy reaches 94%when 103 physical and chemical indexes are used as characteristics,and the precision,recall and Markov correlation coefficient are 88%,75%and 0.78 respectively.It is obvious that this model has a good ability to recognize the binding regions of disease-related proteins and DNA molecules.
作者
冯永娥
孙鹏哲
FENG Yong′e;SUN Pengzhe(College of Science,Inner Mongolia Agricultural University,Hohhot 010018,China)
出处
《内蒙古农业大学学报(自然科学版)》
CAS
北大核心
2024年第1期57-62,共6页
Journal of Inner Mongolia Agricultural University(Natural Science Edition)
基金
国家自然科学基金项目(62262050)
国家自然科学基金专项项目(62141204)