摘要
DNA结合蛋白的识别与预测对于研究生物体的生命活动,理解生命活动内在机理具有十分重要的作用。随着蛋白质序列数目的快速增加,计算方法比传统实验方法具有更大的优势。本文从蛋白质的序列信息和结构信息入手,对目前DNA结合蛋白特征提取方法进行归纳总结。在PDB1075和PDB186数据集上,利用XGBoost算法对9种蛋白质序列特征提取方法进行对比分析。结果显示,不同的特征提取方法具有各自的优势与不足,其中,基于蛋白质序列进化信息的Local_DPP方法综合表现最好。
The recognition and prediction for DNA-binding proteins play a very important role in studying and understanding the internal mechanisms life activities. The huge numbers of protein sequences have been produced. Computational method has greater advantages than traditional experimental methods. In this paper, we summary the existed methods of DNA-binding protein for feature ex-traction based on the sequence information and structural information of the protein. The XGBoost algorithm is employed to compare and analyze the nine feature extraction methods of protein se-quence on the PDB1075 and PDB186 datasets. The results demonstrate that different feature ex-traction methods have their own advantages and disadvantages. Among them, the Local_DPP method based on the evolution information of protein sequences has the best comprehensive pre-diction performance.
出处
《计算生物学》
2020年第2期21-30,共10页
Hans Journal of Computational Biology
关键词
DNA结合蛋白
特征提取
序列信息
结构信息
DNA-Binding Protein
Feature Extraction
Sequence Information
Structure Information